“Are you kidding me?”: Detecting Unpalatable Questions on Reddit

Abusive language in online discourse negatively affects a large number of social media users. Many computational methods have been proposed to address this issue of online abuse. The existing work, however, tends to focus on detecting the more explicit forms of abuse leaving the subtler forms of abuse largely untouched. Our work addresses this gap by making three core contributions. First, inspired by the theory of impoliteness, we propose a novel task of detecting a subtler form of abuse, namely unpalatable questions. Second, we publish a context-aware dataset for the task using data from a diverse set of Reddit communities. Third, we implement a wide array of learning models and also investigate the benefits of incorporating conversational context into computational models. Our results show that modeling subtle abuse is feasible but difficult due to the language involved being highly nuanced and context-sensitive. We hope that future research in the field will address such subtle forms of abuse since their harm currently passes unnoticed through existing detection systems.


Introduction
Abusive language and other antisocial behaviour is omnipresent in online discourse. According to a recent survey, 41% of Americans have personally experienced some form of online harassment (Duggan, 2017). To counter abusive behaviour online, different social media platforms implement their own mechanisms such as content moderation, muting or blocking users from posting etc. It is, however, infeasible to manually moderate online communities due to the sheer enormity of content produced every day -Twitter, for example, receives over 500 million tweets per day. Manual moderation in such a scenario would require humans to read millions of tweets daily which would take an impractical amount of time and other resources. Consequently, many computational models have been proposed by the Natural Language Processing (NLP) community to detect online abuse and facilitate automatic content moderation.
Abuse is an umbrella term which can cover several types of negative expressions. There exists a plethora of abuse detection studies employing different terminology: personal attacks (Wulczyn et al., 2017), bullying (Dadvar et al., 2013;Chatzakou et al., 2017), hate speech (Warner and Hirschberg, 2012;Davidson et al., 2017;Djuric et al., 2015;Gao and Huang, 2017), nastiness (Samghabadi et al., 2017), harassment (Golbeck et al., 2017;Yin et al., 2009), hostility (Liu et al., 2018, racism or sexism (Waseem and Hovy, 2016), abusive language (Nobata et al., 2016), aggression (Caines et al., 2018), and others. However, extant work in abuse detection has largely focused on detecting overt abuse ignoring the more subtle forms of abuse which can be just as damaging. This is also noted in a recent survey calling on the NLP community to rethink and expand what constitutes abuse (Jurgens et al., 2019).
In this work, we make three contributions to address this gap in the literature. First, inspired from the theory of linguistic impoliteness, we propose a novel task of detecting a subtler form of abuse called unpalatable questions (UQ). It is one of the conventionalized impoliteness formulae introduced by Culpeper (2010). We define the UQ task as detecting a negatively phrased question designed to antagonise its recipient in online discourse.
Second, we collect, annotate, and make publicly available a context-aware dataset for the UQ task. 1 The data comes from a diverse set of online communities (or subreddits) on the popular social media site Reddit. Most existing datasets used in abuse detection (Wulczyn et al., 2017;Waseem and Hovy, 2016;Davidson et al., 2017;Founta et al., 2018;Golbeck et al., 2017) only include annotations for stand-alone comments or tweets. In comparison, we explicitly consider conversational context during annotation and preserve contextual information in our dataset (see Section 4.4 for a detailed comparison).
A major limitation of existing abuse detection studies -also pointed out in (Castelle, 2018;Mishra et al., 2019;Gao and Huang, 2017) -is that a comment is treated as a single-utterance in isolation, ignoring any conversational context provided by other comments in the discussion. This is problematic since abuse is inherently contextual, and it becomes a major issue when working with subtler forms of abuse such as unpalatable questions. To this end, our third contribution is that we implement a wide array of learning models to detect unpalatable questions and investigate the benefits of incorporating conversational context in our computational models.

What is an Unpalatable Question?
We adopt the term unpalatable question (UQ) from the conventionalised impoliteness formulae introduced by Culpeper (2010). Although Culpeper did not formally define UQ, several examples were laid out: 'why do you make my life impossible?', 'which lie are you telling me?', 'what's gone wrong now?'. We find that UQs tend to be rhetorical in nature in that they are usually asked not to elicit an answer but to make a point. In particular, they have a close resemblance to epiplexis: a type of rhetorical question which is asked not to elicit information but to reproach, upbraid, or rebuke (Zimmerman, 2005). This can be seen in the examples listed in Table 1, where the questions are asked to shame the interlocutor for adopting a particular point of view and are often insults asked as questions. For our task, we define an unpalatable question as a negatively phrased question designed to antagonise its recipient.
Why UQ? Jurgens et al. (2019) outline a spectrum of abusive behaviour highlighting that existing work only focuses on overt abuse ignoring both the subtler forms and extreme behaviours. As can be seen in Figure 1, our task of detecting a subtler form of abuse is a step towards addressing this gap. Moreover, studies in linguistics show that being asked an unpalatable question puts the recipient in a vulnerable position to receive further verbal attacks (Bousfield, 2007;Wijayanto et al., 2017).

Abusive Language Detection
Work on abuse detection has studied specific types of abuse using several feature-based and deep learning approaches. One of the earliest studies was by Yin et al. (2009) employing SVM to detect 'personal insult harassment' using TF-IDF values for words and sentiment-based features. Using a similar but enhanced set of features, Davidson et al. (2017) implement Logistic Regression and SVM to detect hate speech and offensive language on Twitter. Warner and Hirschberg (2012) use a templatebased strategy to extract features from text and a linear-SVM to detect hate speech with a focus on anti-semitic language. Djuric et al. (2015) report an AUC of 80% using Logistic Regression with paragraph2vec which outperformed standard bagof-words approaches. Nobata et al. (2016) also use word2vec and comment2vec as one of their features to detect 'abusive language' which, in their work, encompasses hate speech, profanity and derogatory language. Wulczyn et al. (2017) implement a multilayer perceptron with word and character n-grams to detect personal attacks on Wikipedia and report an AUC of 96.5%.
In recent years, deep learning techniques have been widely adopted to detect online abuse. Pavlopoulos et al. (2017) show that RNN with GRU cells outperform the original classifier on detecting personal attacks (Wulczyn et al., 2017). Park and Fung (2017)  Oh, your brain stopped functioning at that? Well then, I'll repeat myself. The abstract is enough, the article, you can find it yourself. I'm not going to waste my time.

Linguistic Impoliteness
Long before detecting online abuse gained attention, there had been significant research on linguistic impoliteness. The most notable contribution in this field is by Culpeper (1996) who introduced his theory of impoliteness as a parallel to Brown and Levinson (1987)'s politeness theory. Impoliteness is defined as the use of strategies to attack the interlocutor's face -a persona that one presents in a conversation (Goffman, 1967) -and create social disruption (Culpeper, 1996

Rhetorical Questions
Rhetorical questions are defined as sentences "that have the form of a question but serve as a statement" (Anzilotti, 1982). Since unpalatable questions tend to be rhetorical in nature, we present a brief overview of the literature on rhetorical question detection in social media.
One of the first studies was on Twitter data by Li et al. (2011) where they distinguish 'qweets'tweets that ask for some information -from other interrogative tweets including rhetorical questions. They implement SVM using a set of different handcrafted features. Bhattasali et al. (2015) use bag of n-grams to detect rhetorical questions in the Switchboard Dialogue Corpus. Their best-performing model achieved a F1-score of 0.53 by incorporating both preceding and subsequent text. Using questions from Twitter and Debate Forums, Oraby et al. (2017) implement SVM and LSTM to detect rhetorical questions, and further distinguish between sarcastic rhetorical questions and other questions. There exists other studies modeling rhetorical questions that draw inspiration from linguistic theories behind the motivations of users to post rhetorical questions (Ranganath et al., 2016(Ranganath et al., , 2018. A general consensus in these studies is that rhetorical questions are hard to accurately classify due to their syntactic similarity to regular questions.

Data
The aim is to detect unpalatable questions in online discourse. For this, we construct a dataset using comments from Reddit and annotate them for whether they contain an unpalatable question or not. We also preserve conversational context in the dataset by including the preceding comment in the discussion; therefore, our data consists of (pc i , r i , y i ) tuples denoting the preceding comment, reply 3 (or main comment), and the corresponding label respectively. The task is formulated as a binary classification problem where y i = 1 indicates that the main comment r i contains an unpalatable question.
We collect data from a diverse set of 15 online communities (or subreddits) belonging to different genres: politics, sports, hate and toxic. 4 The subreddits were carefully selected to prevent the dataset from being heavily skewed towards not unpalatable samples since these topics are more likely to involve opinionated and antagonistic discussions.

Question Filter
A challenge during data collection was to filter out comments that did not contain a question. We experiment with two approaches: (1) simple rulebased approach where we tokenize the comment and extract sentences that end with a '?', and (2) parsing-based approach where we first generate constituent parse trees using Stanford CoreNLP (Manning et al., 2014) and then identify questions using clause-level Penn Treebank Tags. 5 Performance Comparison. We manually annotated a random sample of 300 Reddit comments for the presence of questions. There were a total of 81 questions out of which 74 contained a '?'. Although the parsing-based approach achieved a high precision, it missed out on several questions due to the low accuracy of the parser on noisy social media text. On the other hand, the simpler rulebased approach achieved a much higher recall and only missed out on the 7 samples that did not contain a '?'. Given the high disparity in performance, we decided to use the rule-based approach as our question filter. Although a potential data limitation, it is an acceptable design decision given that 91% of questions in our random sample were explicitly phrased using a '?'. This simple '?' heuristic has also been successfully used for identifying questions in other social media studies (Zhao and Mei,

Crowdsourcing
We use Amazon Mechanical Turk for crowdsourcing our data annotations. The coders were shown the main comment and also the preceding comment for context. They were asked to label the main comment for whether it contained an unpalatable question or not. Each comment in our dataset is labeled by at least five different coders.
Quality Control. Since coders can sometimes be unreliable at labeling abusive content (Nobata et al., 2016), we employ three measures to ensure high quality annotations. First, we were able to provide high-quality training to our coders through the use of clear instructions that laid out detailed tips, examples, and counter examples. 6 Second, we allowed only qualified coders to contribute to the task -they were required to achieve a perfect score on a quiz which had a total of 10 questions. Third, we inserted secret test questions throughout our task to address the issue of spam responses (Kittur et al., 2008). The coders were disqualified and blocked if their accuracy on the test questions fell below our predefined threshold of 90%.

Data Description
We aggregated the five annotations by taking the majority as the final label -a data sample is considered unpalatable if at least 3 coders labeled it as unpalatable. In order to not lose useful information, we added a confidence dimension to the dataset which is the ratio of the number of annotations with the majority label and the total number of annotators: confidence ∈ {0.6, 0.8, 1.0}. As can be seen in Table 2, 1,917 (17.5%) comments contain an unpalatable question, and the remaining 82.5% of comments do not. It is interesting to note the distribution of confidence scores across the two labels. Annotators seem to be much more confident for not unpalatable samples: 58% of samples correspond to a confidence score of 1.0 as compared to 25% for unpalatable samples. Following a similar trend, 45% of comments labeled unpalatable have a confidence score of 0.6 as compared to only 14% for not unpalatable samples. This highlights the nuanced nature and complexity associated with identifying unpalatable questions.
Annotator Agreement. We compute two measures of inter-annotator reliability: (1) Cohen's Kappa, and (2) Krippendorff's alpha. Our data achieved a Kappa score of 0.82 against a random sample of 150 comments manually annotated by the authors. Out of 150 comments, there were a total of 8 instances of disagreement -7 out of which had a confidence score of 0.6. Next, we compute Krippendorff's α which is used when there are multiple coders annotating overlapping but different sets of comments (Krippendorff, 2004). Our data achieved an α = 0.39 which is in-line with other abuse detection work that used crowdsourcing (Wulczyn et al., 2017;Cheng et al., 2015).

Comparison with Existing Datasets
As previously discussed, most datasets used for abuse detection contain annotations only for standalone comments in isolation. This is problematic since offensiveness can highly depend on the context. Castelle (2018) shows how their learning models failed (F1 = 0.3) on a StackOverflow dataset that required contextual enrichment to determine the offensiveness of a comment -a majority of comments, that were originally flagged as offensive, were not considered offensive by their coders. This is because the dataset lacked interactional context which was available to StackOverflow users when they originally flagged it as offensive.
We are aware of the following existing datasets that include contextual information: 7 • Karan andŠnajder (2019) published a large dataset of 400k comments from Wikipedia including complete discussion threads. However, a major limitation of their data is that the labels are generated automatically using an existing toxicity classifier. 8 This implies that their labels would not be accurate for comments where the original toxicity classifier it-self fails. In comparison, we perform manual annotation where our coders explicitly consider interactional context.
• Liu et al. (2018) published a dataset of 30,987 comments from Instagram annotated for hostility. The coders were shown an Instagram post and all comments in the thread. However, their data collection is biased (intentionally) towards teenagers and is filtered by certain keywords, eg: profanities, emojis. In comparison, our dataset involves a random sample of comments from a diverse set of subreddits without the use of any keyword-filtering.
• Gao and Huang (2017) released a dataset of 10 complete discussion threads from Fox News. The data includes additional contextual information in the form of user screen name, other comments in the thread, and title of the news article. However, their dataset is much smaller with only 1,528 comments and includes only two annotations per comment. In comparison, we use at least five annotations for each of the 10,909 comments in our dataset.

Methodology
In this section, we introduce our methodology for detecting unpalatable questions.

Traditional Machine Learning
We implement Logistic Regression 9 using a diverse set of features: • N-grams: We use TF-IDF values for word unigrams, bigrams, and trigrams. We also utilise character trigrams, 4-grams, and 5-grams. capital words, question marks, exclamation marks and second person pronouns.
• Lexicon-based: This category includes three features computed using pre-defined lexicons: -We compute the number of non-English words by comparing against NLTK words (Loper and Bird, 2002) and Enchant's dictionary. 11 -We compute the number of toxic words using a lexicon compiled from different sources: list of bad words released by • Sentiment: We use the positive, negative, and neutral sentiment scores returned by VADER, a rule-based sentiment analyser built for social media text (Hutto and Gilbert, 2014).
To build feature vectors, we experiment with the features in isolation as well as several combinations of these feature categories. The feature vectors along with the corresponding labels y i are then fed to the learning algorithm, which is implemented using scikit-learn (Pedregosa et al., 2011).

Deep Learning
Deep learning models have been successfully used in many abuse detection studies (Pavlopoulos et al., 11 https://github.com/rfk/pyenchant 12 https://code.google.com/archive/p/badwordslist/downloads Figure 3: The skeletal architecture for our deep learning models that incorporate interactional context. 2017;Park and Fung, 2017;Aken et al., 2018). In this work, we implement a number of deep learning models -both CNN and RNN-based -using the architecture shown in Figure 2. We use pre-trained GloVe embeddings for the embedding layer. 13 An encoder is responsible for condensing a sequence of word vectors to a single vector. We experiment with a number of neural networks for the encoder: CNN, LSTM, Bidirectional LSTM, and Stacked Bidirectional LSTM.
ELMo. For the embedding layer, we also experiment with deep contextualized ELMo (Peters et al., 2018) representations 14 as an alternative to using GloVe embeddings. The encoder layer here can be either a CNN, LSTM, Bi-LSTM, or Stacked Bi-LSTM.
Dense Hybrid. We also implement deep learning models that utilise the various hand-engineered features discussed in Section 5.1. For this, we compute a 'dense' feature vector using those feature categories, and concatenate it with the neural encoder's output. This combined vector is then fed to a fully-connected feedforward neural network followed by a softmax layer. Implementation details. All models are implemented using AllenNLP (Gardner et al., 2017), an open-source deep learning library for NLP. The training objective is weighted cross entropy loss, and Adam optimizer (Kingma and Ba, 2014) is used for learning network weights. Additionally, early stopping is implemented to terminate training of the neural network once the loss stops improving on a set-aside validation set.

Incorporating Conversational Context
Since humans can better comprehend a comment with reference to its context, we wanted to investigate the benefits of incorporating conversational context in the learning models. For traditional machine learning models, we concatenate the feature vectors for the preceding comment pc i and main comment r i which is then fed to the learning algorithm. For deep learning models, we first vectorize the preceding comment pc i and the main comment r i using the same encoder pipeline. The two vectors are then concatenated and fed to a feedforward neural network (Figure 3). In the LSTM-based models, the final hidden states of the pc i and r i pipeline are concatenated. For CNN, the output of the max pooling layers of the pc i and r i pipeline are concatenated. In addition to simple concatenation, we experiment with additional heuristics to model context inspired from the task of Natural Language Inference (NLI). Specifically, we used a CNN encoder to vectorize the context pc i and main comment r i . They are then combined using three-heuristics: (1) concatenation, (2) element-wise product, and (3) element-wise difference (Mou et al., 2016).

Experiments and Results
We conduct experiments across two dimensions: 1. Confidence Score: We hypothesize that the models would exhibit better performance on data samples which were easier for the coders to annotate. To test this, we experiment with two scenarios: • All-Data: we use the complete dataset.
• High-Agreement-Data: we use data samples corresponding to confidence = 1.0.

Text Input:
To investigate the benefits of including contextual information, we experiment with three input scenarios: • Question Text Only: we only provide the question text as input.
• Reply Text: we provide the full text of the main comment as input.
• Reply Text + Comment Text: In addition to the main comment, we also provide the preceding comment text as input.
We evaluate our computational models on several classification metrics: precision, recall, F1score, and Area under Receiver Operating Characteristic curve (AUROC) and Precision-Recall curve (AUPRC). All reported values are averaged over stratified five-fold cross-validation runs. The empirical results on All-Data and High-Agreement-Data are presented in Table 3 and Table 4 respectively. 15

Discussion
Among traditional learning algorithms, a combination of simple word unigrams, bigrams, and trigrams achieves the best F1-score of 0.44. Adding other hand-engineered features to word(1, 3) results in a better precision and AUPRC. As expected, deep learning models outperform traditional machine learning algorithms for both All-Data and High-Agreement-Data scenarios. In particular, CNN models perform much better than LSTM models. This is evident from Tables 3 and 4 where the best-CNN model outperforms the best-LSTM model by a 3-point and 5-point increase in F1-score respectively. We observe improvements with using contextualized ELMo embeddings as opposed to static GloVe embeddings for both scenarios. Moreover, the addition of dense hand-engineered feature vector further improves the F1-score to 0.532. Finally, our hypothesis, that it would be easier for the models to classify if it was easier for the humans, holds true in that there is a considerable improvement in the F1-score (15 points) for High-Agreement-Data.
Despite the performance gains observed with using more sophisticated deep learning models, the performance is still poor to be used for any practical applications. This is not surprising given how linguistically nuanced our dataset is and the complexity associated with abusive language detection on noisy social media text (Nobata et al., 2016). Specifically, learning models struggle to deal with implicit abuse -language which does not immediately convey abuse (Waseem et al., 2017). Aken et al. (2018) find that their toxicity classifier fails on data where there were instances of sarcasm, toxicity without employing swear words, and rhetorical questions. We qualitatively examined a random sample of hundred unpalatable samples from our dataset, and found that 65% do not contain swear words and 20% involve sarcasm. Similarly, from a random sample of hundred mis-classified unpalatable samples, 72% do not contain swear words and 30% involve sarcasm. Moreover, since unpalatable questions are rhetorical in nature, it is not surprising that learning models performed relatively poorly on the task.
Context. Our assumption was that models would benefit from conversational context since humans find it easier to determine the offensiveness of a comment when provided with some context. It is, however, evident from our empirical results that incorporating context through providing the preceding comment to the model did not improve performance for both traditional machine learning and deep learning models. This finding is consistent with other studies that attempt to incorporate interactional context into their models (Karan anď Snajder, 2019;Lee et al., 2018). We believe that effectively incorporating deeper context, as opposed to just the preceding comment, using more sophisticated methods such as hierarchical neural networks might help improve performance.
Evaluation Metrics. Mishra et al. (2019) observe a problematic trend with several abuse detection studies using AUROC for evaluation. This is not ideal since ROC plots can be deceptive when dealing with imbalanced classification scenarios (Saito and Rehmsmeier, 2015). Since most abuse detection datasets tend to be heavily skewed towards non-abusive samples, this can lead to misleadingly optimistic values for AUROC (also observed in Tables 3 and 4). A better alternative is to report AUPRC which is more robust to imbalanced data since it evaluates the fraction of true positives among positive predictions at different thresholds (Saito and Rehmsmeier, 2015).

Conclusion
In this work, we addressed an important gap in the abuse detection literature by introducing a novel task of detecting unpalatable questions. We also released a context-rich dataset for the task and implemented a number of learning models to automatically detect unpalatable questions. Our results show that it is difficult to model subtle abuse due to the language being nuanced and context-sensitive. This calls for advancements in natural language understanding methods that can identify such implicit signals and take pragmatic context into account. We hope that future research would explore other forms of abuse and draw inspiration from related fields such as linguistic impoliteness. Detecting abuse -both overt and subtle -on the Internet would help enhance user's experience online and facilitate civil and productive discussions.

A Data Collection
We collect data from a diverse set of 15 Reddit communities (or subreddits) belonging to different genres: • Politics: r/The Donald r/politics r/PoliticalDiscussion r/Conservative • Sports: r/nfl r/sports r/nba r/hockey • Hate and Toxic: r/cringepics r/cringe r/4chan r/CringeAnarchy r/KotakuInAction r/ImGoingToHellForThis r/TumblrInAction

B Crowdsourcing Instructions
The instructions provided to Amazon Mechanical Turk coders are shown in Figure 4 and Figure 5.

C Results
The complete list of results for the All-Data scenario are shown in Table 5 (deep learning models) and Table 7 (traditional machine learning models). Next, the complete list of results for the High-Agreement-Data scenario are shown in Table 6 (deep learning models) and Table 8 (traditional machine learning models).