“Laughing at you or with you”: The Role of Sarcasm in Shaping the Disagreement Space

Detecting arguments in online interactions is useful to understand how conflicts arise and get resolved. Users often use figurative language, such as sarcasm, either as persuasive devices or to attack the opponent by an ad hominem argument. To further our understanding of the role of sarcasm in shaping the disagreement space, we present a thorough experimental setup using a corpus annotated with both argumentative moves (agree/disagree) and sarcasm. We exploit joint modeling in terms of (a) applying discrete features that are useful in detecting sarcasm to the task of argumentative relation classification (agree/disagree/none), and (b) multitask learning for argumentative relation classification and sarcasm detection using deep learning architectures (e.g., dual Long Short-Term Memory (LSTM) with hierarchical attention and Transformer-based architectures). We demonstrate that modeling sarcasm improves the argumentative relation classification task (agree/disagree/none) in all setups.


Introduction
User-generated conversational data such as discussion forums provide a wealth of naturally occurring arguments. The ability to automatically detect and classify argumentative relations (e.g., agree/disagree) in threaded discussions is useful to understand how collective opinions form, how conflict arises and is resolved (van Eemeren et al., 1993;Abbott et al., 2011;Walker et al., 2012b;Misra and Walker, 2013;Ghosh et al., 2014;Rosenthal and McKeown, 2015;Stede and Schneider, 2018). Linguistic and argumentation theories have thoroughly studied the use of sarcasm in argumentation, including its effectiveness as a persuasive device or as a means to express an ad hominem * Equal Contribution.

Arg. Rel. Turn Pairs
Prior Turn: Today, no informed creationist would deny natural selection. Agree Current Turn: Seeing how this was proposed over a century and a half ago by Darwin, what took the creationists so long to catch up? Prior Turn: Personally I wouldn't own a gun for self defense because I am just not that big of a sissy. Disagree Current Turn: Because taking responsibility for ones own safety is certainly a sissy thing to do? Prior Turn: I'm not surprised that no one on your side of the debate would correct you, but wolves and dogs are both members of the same species. The Canid species. Current Turn: Wow, you 're even wrong when you get away from your precious Bible and try to sound scientific. Prior Turn: The hand of God kept me from serious harm. Maybe He has a plan for me. N one Current Turn: You better hurry up . Are n't you like 113 years old. fallacy (attacking the opponent instead of her/his argument) (Tindale and Gough, 1987;van Eemeren and Grootendorst, 1992;Gibbs and Izett, 2005;Averbeck, 2013). We propose an experimental setup to further our understanding of the role of sarcasm in shaping up the disagreement space in online interactions. The disagreement space, defined in the context of the dialogical perspective on argumentation, is seen as the speech acts initiating the difference of opinions that argumentation is intended to resolve (Jackson, 1992;van Eemeren et al., 1993). Our study is based on the Internet Argument Corpus (IAC) introduced by Abbott et al. (2011) that contains online discussions annotated for the presence/absence and the type of an argumentative move (agree/disagree/none) as well as the presence/absence of sarcasm. Consider the dialogue turns from IAC in Table 1, where the current turn (henceforth, ct) is a sarcastic response to the prior turn (henceforth, pt). These dialogue moves can be argumentative (agree/disagree) or not argumentative (none). The argumentative move can express agreement (first example) or disagreement (the second example is an undercutter, while the third example is an ad hominem attack). The fourth example, although sarcastic, it is not argumentative. It can be noticed that none of the current turns contain explicit lexical terms that could signal an argumentative relation with the prior turn. Instead, the argumentative move is being implicitly expressed using sarcasm.
We study whether modeling sarcasm can improve the detection and classification of argumentative relations in online discussions. We propose a thorough experimental setup to answer this question using feature-based machine learning approaches and deep learning models. For the former, we show that combining features that are useful to detect sarcasm (Joshi et al., 2015;Muresan et al., 2016;Ghosh and Muresan, 2018) with state-of-theart argument features leads to better performance for the argumentative relation classification task (agree/disagree/none) (Section 5). For the deep learning approaches, we hypothesize that multitask learning, which allows representations to be shared between multiple tasks (e.g., here, the tasks of argumentative relation classification and sarcasm detection), lead to better generalizations. We investigate the impact of multitask learning for a dual Long Short-Term Memory (LSTM) Network with hierarchical attention (Ghosh et al., 2017) (Section 4.2) and BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019), including an optional joint multitask learning objective with uncertainty-based weighting of task-specific losses (Kendall et al., 2018) (Section 4.3). We demonstrate that multitask learning improves the performance of the argumentative relation classification task for all settings (Section 5). We provide a detailed qualitative analysis (Section 5.1) to give insights into when and how modeling sarcasm helps. We make the code from our experiments publicly available. 1 The Internet Argument Corpus (IAC) (Walker et al., 2012b) can be found for public acess here: 2 1 https://github.com/ritvikshrivastava/multitask transformers 2 https://nlds.soe.ucsc.edu/iac2 2 Related Work Argument mining is a growing area of research in computational linguistics, focusing on the detection of argumentative structures in a text (see Stede and Schneider (2018) for an overview). This paper focuses on two subtasks: argumentative relation identification and classification (i.e., agree/disagree/none). Some of the earlier work on argumentative relation identification and classification has relied on feature-based machine learning models, focusing on online discussions (Abbott et al., 2011;Walker et al., 2012b;Misra and Walker, 2013;Ghosh et al., 2014;Wacholder et al., 2014) and monologues Gurevych, 2014, 2017;Persing and Ng, 2016;Ghosh et al., 2016). Stab and Gurevych (2014) proposed a set of lexical, syntactic, semantic, and discourse features to classify them. On the same essay dataset, Nguyen and Litman (2016) utilized contextual information to improve the accuracy. Both Stab and Gurevych (2017) and Persing and Ng (2016) used Integer Linear Programming (ILP) based joint modeling to detect argument components and relations. Rosenthal and McKeown (2015) introduced sentence similarity and accommodation features, whereas Menini and Tonelli (2016) presented how entailment between text pairs can discover argumentative relations. Our argumentative features in the feature-based model are based on the above works (Section 4.1). We show that additional features that are useful in sarcasm detection (Joshi et al., 2015;Ghosh and Muresan, 2018) enhance the performance on the argumentative relation identification and classification tasks.
In addition to feature-based models, deep learning models have been recently used for these tasks. Potash et al. (2017) proposed a pointer network, and Hou and Jochim (2017)  Finally, analyzing the role of sarcasm and verbal irony in argumentation has a long history in linguistics (Tindale and Gough, 1987;Gibbs and Izett, 2005;Averbeck, 2013;van Eemeren and Grootendorst, 1992). We propose joint modeling of argumentative relation detection and sarcasm detection to empirically validate sarcasm's role in shaping the disagreement space in online conversations.
While the focus of our paper is not to provide a state-of-the-art sarcasm detection model, our feature-based models, along with the deep learning models for sarcasm detection are based on state-ofthe-art approaches. We implemented discrete features such as pragmatic features (González-Ibáñez et al., 2011;Muresan et al., 2016), diverse sarcasm markers (Ghosh and Muresan, 2018), and incongruity detection features (Riloff et al., 2013;Joshi et al., 2015). The LSTM models are influenced by Ghosh and Veale (2017); Ghosh et al. (2018), where the function of contextual knowledge is used to detect sarcasm. Lastly, transformer models such as BERT and RoBERTa have been used in the winning entries for the recent shared task on sarcasm detection (Ghosh et al., 2020). In our research, for both kinds of deep-learning models, the best results are obtained by using the multitask setup, showing that multitask learning indeed helps improve both tasks.

Data
Our training and test data are collected from the Internet Argument Corpus (IAC) (Walker et al., 2012a). This corpus consists of posts from conversations in online forums on a range of controversial political and social topics such as Evolution, Abortion, Gun Control, and Gay Marriage (Abbott et al., 2011(Abbott et al., , 2016. Multiple versions of IAC corpora are publicly available, and we use a particular subset, marked as IAC orig , collected from Abbott et al. (2011). This consists of around 10K pairs of conversation turns (i.e., prior turn pt and the current turn ct) that were annotated using Mechanical Turk for argumentative relations (agree/disagree/none) and other characteristics such as sarcasm/non-sarcasm, respect/insult, nice/nastiness. Median Cohen's κ is 0.5 across all topics.
Each "current turn" that is part of a <pt,ct> pair is also labeled with a Sarcasm (S) or Non-Sarcasm (N S) label. Table 2 shows the data statistics in terms of argumentative relations (A/D/N ) and sarcasm (S/N S). We split the dataset into training (80%; 7,982 turn pairs), test (10%; 999 turn pairs), and dev (10%; 999 turn pairs) sets where each set contains a proportional number of instances (i.e., 80% of 315 (=252) sarcastic turns (S) with argument relation label A (agree) appears in the training set). The dev set is used for parameter tuning.

Experimental Setup
We present the computational approaches to investigate whether modeling sarcasm can help detect argumentative relations. As our goal is to provide a comprehensive empirical investigation of sarcasm's role in argument mining rather than propose new models, we explore three separate machine learning approaches well-established for studying argumentation and figurative language. First, we implement a Logistic Regression method that exploits a combination of state-of-the-art features to detect argumentative relations as well as sarcasm (Section 4.1). Second, we present a dual LSTM architecture with hierarchical attention and its multitask learning setup (Section 4.2). Third, we discuss experiments using the pre-trained BERT models and our multitask learning architectures based on it (Section 4.3).

Logistic Regression with Discrete Features
We use a Logistic Regression (LR) model that uses both argument-relevant (ArgF ) and sarcasmrelevant (SarcF ) features. Unless mentioned, all features were extracted from the current turn ct.
Argument-relevant features (ArgF ). We first evaluate the features that are reported as being useful for identifying and classifying argumentative relations: (a) n-grams (e.g., unigram, bigram, trigram) created based on the full vocabulary of the IAC corpus; (b) argument lexicons: two lists of twenty words representing agreement (e.g., "agree", "accord") and disagreement (e.g., "differ", "oppose"), respectively (Rosenthal and McKeown, 2015) (c) sentiment lexicons such as MPQA (Wilson et al., 2005) and opinion lexicon (Hu and Liu, 2004) to identify sentiment in the turns; (d) hedge features, since they are often used to mitigate speaker's commitment (Tan et al., 2016); (e) PDTB discourse markers because claims often start with discourse markers such as therefore, so. We discard markers from the temporal relation; (f) modal verbs because they signal the degree of certainty when expressing a claim (Stab and Gurevych, 2014); (g) pronouns, since they dialogically point to the previous speaker's stance; (h) textual entailment: captures whether a position expressed in the prior turn is accepted in the current turn (Cabrio and Villata, 2012; Menini and Tonelli, 2016) 3 ; (i) lemma overlap to determine topical alignment between the prior and current turn (Somasundaran and Wiebe, 2010). We compute lemma overlap of noun, verbs, and adjectives between the turns, and (j) negation to extract explicit negation cues (e.g., "not", "don't") that often signal disagreement.
Sarcasm-relevant features (SarcF ). As sarcasm-relevant features we use: (a) Linguistic Inquiry Word Count (LIWC) (Pennebaker et al., 2001) features to capture the linguistic, social, individual, and psychological processes; (b) measuring sentiment incongruity, that is, capturing the number of times the difference in sentiment polarity between the prior turn pt and the current turn ct occurs and number of positive and negative sentiment words in turns (Joshi et al., 2015); (c) sarcasm markers used by Ghosh and Muresan (2018), such as capitalization, quotation marks, punctuation, exclamations that emphasize a sense of surprisal, tag questions, interjections because they seem to undermine a literal evaluation, hyperbole because users frequently overstate the magnitude of an event in sarcasm, and emoticons & emojis, since they often emphasize the sarcastic intent.
We use SKLL, an open-source Python package that wraps around the Scikit-learn tool (Pedregosa et al., 2011). 4 We perform the feature-based experiment using the Logistic Regression model from Scikit-learn.
In the experimental runs, LR ArgF (i.e., model that uses just the ArgF features) denotes the individual model and LR ArgF +SarcF (i.e., model that uses both ArgF and SarcF features) is the joint model.

Dual LSTM and Multitask Learning
LSTMs are able to learn long-term dependencies (Hochreiter and Schmidhuber, 1997) and have been shown to be effective in Natural Language Inference (NLI) research, where the task is to establish the relationship between multiple inputs (Rocktäschel et al., 2015). This type of architecture is often denoted as the dual architecture since one LSTM models the premise and the other models the hypothesis (in Recognizing Textual Entailment(RTE) tasks). Ghosh et al. (2018) used the dual LSTM architecture with hierarchical attention (HAN) (Yang et al., 2016) for sarcasm detection to model the conversation context, and we use their approach in this paper to model the current turn ct and the prior turn pt. HAN implements attention both at the word level and sentence level. The distinct characteristics of this attention is that the word/sentence-representations are weighted by measuring similarity with a word/sentence level context vector, respectively, which are randomly initialized and jointly learned during training (Yang et al., 2016). We compute the vector representation for the current turn ct and prior turn pt and concatenate vectors from the two LSTMs for the final softmax decision (i.e., A, D or N for argumentative relation detection). Henceforth, this dual LSTM architecture is denoted as LST M attn .
To measure the impact of sarcasm in argumentative relation detection, we use a multitask learning approach. Multitask learning aims to leverage useful information in multiple related tasks to im- prove each task's performance (Caruana, 1997;Liu et al., 2019). We use a simple hard parameter sharing network. The architecture is a replica of the LST M attn , with a modification of employing two loss functions, one for sarcasm detection (i.e., training using the S and N S labels) and another for the argumentative relation classification task (i.e., training using the A, D, and N labels). Figure 1 shows the high-level architecture of the dual LSTM and multitask learning (LST M M T ). The prior turn pt (left) and the current turn ct (right) are read by two separate LSTMs (i.e., LST M pt and LST M ct ). In case of LST M M T the concatenation of v pt and v ct is passed through a dense+Softmax layer for the MTL as shown in Figure 1. Similar to the LR models, LST M attn now represents the individual model (i.e., predicts only the argumentative relation) whereas LST M M T represents the joint model.
Dynamic Multitask Loss. In addition to simply adding the two losses, we also employed dynamic weighting of task-specific losses during the training process, based on the homoscedastic uncertainty of tasks, as proposed in Kendall et al. (2018): where L t and σ t depict the task-specific loss and its variance, respectively, over training instances. We denote this variation as LSTM M T uncert .

Pretrained BERT and Multitask
Learning formance for many NLP tasks. BERT is initially trained on masked token prediction and next sentence prediction tasks over large corpora (English Wikipedia and Book Corpus). During its training, a special token "[CLS]" is added to the beginning of each training instance, and the "[SEP]" tokens are added to indicate the end of utterance(s) and separate, in case of two utterances (e.g., pt and ct). During the evaluation, the learned representation for the "[CLS]" token is processed by an additional layer with nonlinear activation. In its standard form, pre-trained BERT ("bert-base-uncased") can be used for transfer learning by fine-tuning on a downstream task, i.e., argument relation detection where training instances are labeled as A, D, and N . We denote the BERT baseline model as BERT orig that is fine-tuned over the training partition of only the argumentative relation data (i.e., individual task training). Unless mentioned otherwise, we use the BERT predictions available via the "[CLS]" token. To this end, we propose a couple of variations in the multitask learning settings, and they are briefly described in the following sections.
Multitask Learning with BERT. The first model we use for multitask learning is denoted as BERT M T (i.e., BERT Multitask Learning). Here, we pass the BERT output embeddings to two classification heads -one for each task (i.e., detection of argumentative relation and sarcasm), and the relevant gold labels are passed to them. Each classification head is a linear layer (size=3 and 2 for # of labels for argumentative relation and sarcasm detection, respectively) applied on top of the pooled BERT output. The losses from these individual heads are added and propagated back through the model. This allows BERT to model the nuances of both tasks and their interdependence simultaneously. Dynamic Loss: Similar to the LSTM architecture, here, too, we experiment with dynamic multitask loss. We denote this variation as BERT M T uncert .
Alternate Multitask Learning. We employ another multitask learning technique where we attempt to enrich the learning with fine-tuning of labeled additional material from the sarcasm detection task. Notably, we exploit "sarcasm V2", a sarcasm detection dataset that was also curated from the original corpus of IAC and was released by Oraby et al. (2016). We pre-process the "sarcasm V2" dataset by removing duplicates that appear in IAC orig and we end up selecting 3513 training v2 instances and 423 dev v2 instances balanced between S/NS categories for experiments and merged them to the sarcasm dataset (training and dev, respectively) from IAC orig . Note, unlike the original multitask setting, this time we have more sarcastic instances (a total of 11,495) than instances labeled with argumentative roles (7,982 instances as before) for the training purpose, while keeping the test set from IAC orig unchanged.
Since the training data is now unequal between the two tasks of argumentative relation and sarcasm detection, we create mini-batches so that each batch consists of instances with only one task label (i.e., either argumentative labels or sarcasm labels). The batches from the two tasks are interleaved uniformly, i.e., the BERT model is only passed to one of the two tasks' specific classification heads, and the related loss is used to update the parameters in that iteration. This way, the model trains both tasks but alternates between the two tasks per mini-batch iteration while the extra batches of sarcasm data from the "sarcasm V2" dataset are managed at the end together. This model is denoted as BERT ALT (see Figure 2).
For brevity, all models' parameter tuning description (e.g., Logistic Regression, Dual LSTM, BERT) is in the supplemental material. Table 3 presents the classification results on the test set. We report F1 scores for each class (A, D and N ) and Micro-F1 overall score (F1 micro ) (used to account for multi-class and class imbalance).  The LR model using both the SarcF and ArgF features performs better than the model that uses ArgF features alone, improving the overall performance by an absolute 2.9% F1 micro , and showing a huge impact on the agreement class (A) (8.6% absolute improvement). Table 4 shows the top discrete features for argumentative relation identification. From ArgF features (first column), we notice discourse expansion ("particularly"), contrast ("although") and agree/disagree lexicon getting high feature weights. We also notice pronouns receive large feature weights because argumentative text often refers to personal stance (e.g., "you think", "I believe"). However, when analyzing ArgF + SarcF features we find various sarcasm markers, such as tag questions, hyperbole, multiple punctuation, or sarcasm characteristics such as sentiment incongruity receive the highest weights.

Results and Discussion
For LSTM models, we see that multitask learning helps, LSTM M T uncert showing a 2.8% improvement over the single model LSTM Attn , which is statistically significant. Moreover, we notice that the improvement for the agree (A) and disagree (D) classes is 5.1%, with just a small reduction for the none (N ) class (0.7%).
For BERT, we notice better results when performing multitask learning, while the best performing model is obtained from BERT M T uncert where we experimented with the dynamic weighting of task-specific losses during the training process (Kendall et al., 2018). The performance increase is consistent across all three classes. The difference in performance among each setup is sta-LRArgF LRArgF +SarcF pronouns: I. my (both A), your(s) (D); discourse: so, because, for (all A), incidentally, particularly, although (all D); disagree lexicon: disagree, differ (both D);agree lexicon: agreed (A); entailment relation; negation (D) pronouns: mine, my (both A), you (D); discourse: then (A), though, however (both D); modal: will (A); punctuation: multiple question marks (both A and D); tag question: "are you", "do you" (both D); hyperbole: wonderful (A), nonsense, biased (both D); LIWC dimensions: anxiety, assent, certainty (all D); sentiment incongruity (D); interj: so, agreed (both A) Table 4: Top discrete features from LR ArgF and LR ArgF +SarcF models, respectively. A and D depict the argumentative relations (agree and disagree) for the particular feature. tistically significant, as shown in Table 3. Moreover, BERT M T uncert model improves the F 1 micro by a large margin when compared to the LR and the LSTM models. However, adding more data for the auxiliary task (i.e., sarcasm detection) as presented in BERT ALT did not provide any significant improvement, only a 0.2 improvement of F 1 micro over BERT M T (however it does show improvement over the single task model). The reason could be that although "sarcasm V2"is a subset of the original IAC corpus, it was annotated by a different set of Turkers than IAC orig with different annotation guidelines.
Between the three classes -A, D, and N -we observe the lowest performance on the A class. This is unsurprising, given the highly unbalanced setting of the training data (A occurs less than 10% of times in the IAC orig , see Table 2).
In sum, these improvements through multitask learning over single task argumentative relation detection indicate that modeling sarcasm is useful in modeling the disagreement space in online discussions. This provides an empirical justification to existing theories that study sarcasm's impact in modeling argumentation, persuasion, and argument fallacies such as ad hominem attacks. Finally, we notice that multitask learning also improves the performance on the sarcasm detection task (results are presented in the Appendix).

Qualitative Analysis
To further investigate the effect of multitask learning, we present qualitative analysis studies to: 1. Understand the models' performance by look-ing at the turns correctly classified by the multitask models and misclassified by the corresponding individual single task model. We analyze the turns in terms of sarcastic characteristics -whether they depict incongruity, humor, or sarcasm indicators (i.e., markers).

Understand when both multitask and individual model made incorrect predictions.
We compare the predictions between the multitask and the individual models for different settings to address the first issue. For example, BERT M T uncert correctly identifies 6 A, 50 D, and 60 N instances more than BERT Orig (out of 91, 398 and 510 instances, respectively). Two of the authors independently investigated a random sample of 100 instances (qual set) chosen from the union of the test instances that are correctly predicted only by the multitask models (LR ArgF +SarcF , LST M M T uncert , BERT M T uncert , and BERT ALT ) and not by the corresponding individual models (LR ArgF , LST M attn , and BERT Orig ). For both Transformer and LSTM-based models, we explore how attention heads behave and whether common patterns exist (e.g., attending words with opposite meaning when incongruity occurs). We display the heat maps of the attention weights for a pair of prior and current turns (LSTM-based models) (Figure 3) whereas for BERT we display word-toword attentions (Figures 4, 5, 6, 7, and 8) using visualization tools (Vig, 2019;Yang and Zhang, 2018). 5 All the examples presented in this section are argumentative moves (i.e., turns with A or D) correctly identified by our multitask learning models but wrongly predicted as none (N ) by the individual models. Moreover, the multitask learning models also correctly predict that these turns are instances of sarcasm.
Incongruity between prior turn and current turn. Semantic incongruity, which can appear between conversation context pt and the current turn ct is an inherent characteristic of sarcasm (Joshi et al., 2015). This characteristic highlights the inconsistency between expectations and reality, making sarcasm or irony highly effective in persuasive communication (Gibbs and Izett, 2005). In the case of BERT, Figure 4 presents the turns "evolution can't prove the book of genesis false" (pt) ↔ "ignorant of science think evolution has anything to do with the bible" (ct). Here, BERT M T uncert shows more attention between incongruous terms ("genesis" ↔ "science", "evolution") as well as to the mocking word "ignorance". Likewise, Figure 6 presents two turns "you are quite anti religious it seems" (pt) ↔ "anti ignorance and superstition . . . this is religion" (ct). We notice the word "religious" is attending "anti" and "ignorance" with high weights in case of BERT M T uncert (from pt to ct) whereas BERT Orig only attends to the word "religious" from the pt to ct turn. By modeling sarcasm, the multitask learning models can better predict argumentative moves that are expressed implicitly.
We also evaluate the BERT ALT model for the examples presented in Figure 4 and Figure  6. Figure 5 shows that although BERT ALT is attending (from pt to ct) incongruous terms "genesis" ↔ "evolution", the strength of the relation (i.e., attention weight) is comparatively lower than BERT M T uncert (See Figure 4). On the contrary, between Figure 6 and Figure 7, BERT M T uncert model is attending multiple words in ct from the word "religion" in pt, but the BERT ALT model attends only two words 'anti" and "ignorance", with high weights from "religion" (pt to ct).
Humor by word repetition. Often the current turn ct sarcastically taunts the prior turn pt by word repetition and rhyme, imposing a humorous comic effect, also regarded as the phonetic style of humor (Yang et al., 2015). For the pair, "genetics has nothing to do with it" (pt) ↔ "are saying that genetics has nothing to do with genetics?" (ct), we notice in BERT M T uncert the token "it" in pt correctly attends to both occurrences of "genetics" in ct where the second occurrence is the co-reference of "it" (Figure 8), which is missed by the individual model BERT Orig .
Role of sarcasm markers. Sarcasm markers are indicators that alert if an utterance is sarcastic (Attardo, 2000). While comparing the logistic regression models between LR ArgF +SarcF and LR ArgF , we observe markers such as multiple punctuations ("???"), tag question ("are you"), upper case ("NOT") have received the highest features weights ( Table 4). In Figure 3, while the individual model LST M attn attends the words almost equally, we notice in the multitask variation several sarcasm markers such as "ya", "oops", and numerous exclamations ("!!") receive larger attention weights.
Addressing the second issue (i.e., when both multitask and single tasks models make the wrong predictions), we notice that over 100 examples of none (N ) class were classified as argumentative by both BERT M T uncert and BERT Orig . For the none N class, one of the most common instances of wrong predictions is when the current turn ct sarcastically takes a "different stance" on a topic from pt in a narrow context but the whole turn is not argumentative. In the following example: "does he just say the opposite of everything <name> says?" (pt) ↔ "using <name> as a 180 compass is just fine by me" (ct), BERT M T uncert , BERT Orig , LSTM M T uncert , and LST M attn models make disagree D prediction (since ct is sarcastic on "<name>") where the gold label is none N . Looking closely at this pair of turns, it seems that the ct presents a case of ad hominem attack (on the person's "<name>") rather than a none relation.
In the case of argumentative turns (agree and disagree) that are wrongly classified as none by all models, we found two common patterns: the use of concessions (e.g., "it's a consideration, but I doubt we should be promoting this . . . ") and arguments with uncommitted beliefs (e.g., "it is possible that", "that could probably be", "possibly, I must admit").

Conclusion and Future Work
Linguistic and argumentation theories have studied the use of sarcasm in argumentation, including its effectiveness as a persuasive device or as a means to express an ad hominem fallacy. We present a comprehensive experimental study for argumentative relation identification and classification using sarcasm detection as an additional task. First, in discrete feature space, we show that sarcasm-related features, in addition to argument-related features, improve the accuracy of the argumentative relation identification/classification task by 3%. Next, we show that multitask learning using both a dual LSTM framework and BERT helps improve performance compared to the corresponding single model by a statistically significant margin. In both cases, the dynamic weighting of task specific losses performs best. We provide a detailed qualitative analysis by investigating a large sample manually and show what characteristics of sarcasm are attended to, which might have guided the correct prediction on the identification of the argumentative relation/classification task. In the future, we aim to study this synergy further by looking at sarcasm as well as the persuasive strategies (e.g., ethos, pathos, logos), and argument fallacies (e.g., ad hominem attack that was also noticed by Habernal et al. (2018) Dual LSTM and Multi-task Learning experiment: For LSTM networks based experiments we searched the hyper parameters over the dev set. Particularly we experimented with different minibatch size (e.g., 8, 16, 32), dropout value (e.g., 0.3, 0.5, 0.7), number of epochs (e.g., 40, 50), hidden state of different sized-vectors (100, 300) and the Adam optimizer (learning rate of 0.01). Embeddings were generated using FastText vectors (300 dimensions) (Joulin et al., 2016). Any token occurring less than five times were replaced by a special UNK token where the UNK vector is created based on random samples from a normal (Gaussian) distribution between 0.0 and 0.17. After tuning we use the following hyper-parameters for the test set: mini-batch size of 16, hidden state of size 300, number of epochs = 50, and dropout value of 0.5. Task-specific losses for the dynamic multitask version was learned during training.

A.2 Results on the Sarcasm Detection Task
Although improving sarcasm detection is not the focus our paper, we observe that multi-task learning improves the performance on this task as well, when compared to the single task model. We present results for the deep learning models in Table 5. The multi-task models (both for LSTM and BERT) outperform the corresponding single task models (by 6.9 F1 and 6.4 F1 for LSTM and BERT models, respectively). We note that the results on this particular dataset are much lower than on other datasets used for sarcasm detection. For example,