A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, porn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff’s Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.


Introduction
Cyberbullying has been defined as an aggressive and intentional act repeatedly carried out using electronic means against a victim that cannot easily defend him or herself (Smith et al., 2008). Online abuse by contrast can refer to a wide range of behaviours that may be considered offensive by the parties to which it is directed to (Sambaraju and McVittie, 2020). This includes trolling, cyberbullying, sexual exploitation such as grooming and sexting and revenge porn (i.e., the sharing of inappropriate images of former romantic partners). A distinguishing feature of cyberbullying within the wider realm of online abuse is that it is a repeated act and its prevalence on social media (along with other forms of online abuse) has generated significant interest in its automated detection. This has lead to an increase in research efforts utilising supervised machine learning methods to achieve this automated detection. Training data plays a significant role in the detection of cyberbullying and online abuse. The domain-bias, composition and taxonomy of a dataset can impact the suitability of models trained on it for abuse detection purposes, and therefore the choice of training data plays a significant role in the performance of these tasks.
While profanity and online aggression are often associated with online abuse, the subjective nature of cyberbullying means that accurate detection extends beyond the mere identification of swear words. Indeed, some of the most potent abuse witnessed online has been committed without profane or aggressive language. This complexity often requires labelling schemes that are more advanced than the binary annotation schemes used on many existing labelled datasets. This, therefore, influenced our approach in creating the dataset. After extracting data from Twitter using targeted queries, we created a taxonomy for various forms of online abuse and bullying (including subtle and indirect forms of bullying) and identified instances of these and other inappropriate content (e.g., pornography and spam) present within the tweets using a finegrained annotation scheme. The result is a large labelled dataset with a majority composition of offensive content.
This paper is organised as follows. In Section 2, we present an overview of existing online abuserelated datasets. Section 3 discusses the collection method, composition, annotation process and usage implications for our dataset. Results of the experiments performed using the dataset are discussed in Section 4. Finally, conclusion and future research are described in section 5.

Related Work
Social media has become the new playground and, much like physical recreation areas, bullies inhabit facets of this virtual world. The continually evolving nature of social media introduces a need for datasets to evolve in tandem to maintain relevance. Datasets such as the Barcelona Media dataset used in studies such as those by Dadvar and Jong (2012), Nahar et al. (2014), Huang et al. (2014), Nandhini and Sheeba (2015 was created over ten years ago and, while representative of social media usage at the time, social networks such as Myspace, Slashdot, Kongregate and Formspring from which some of the data was sourced are no longer widely used. The consequence of this is that such datasets are no longer representative of current social media usage. Twitter is one of the most widely used social media platforms globally; as such, it is no surprise that it is frequently used to source cyberbullying and online abuse data. Bretschneider et al. (2014) annotated 5,362 tweets, 220 of which were found to contain online harassment; the low proportion of offensive tweets present within the dataset (less than 0.05%), however, limits its efficacy for classifier training. More recently, studies such as those by Rajadesingan et al. (2015), Waseem and Hovy (2016), Davidson et al. (2017), Chatzakou et al. (2017), Hee et al. (2018, Founta et al. (2018), Ousidhoum et al. (2019) have produced datasets with higher positive samples of cyberbullying and online abuse. Rajadesingan et al. (2015) labelled 91,040 tweets for sarcasm. This is noteworthy because while sarcasm can be used to perpetrate online bullying, it is rarely featured in existing cyberbullying datasets' taxonomies. However, as the dataset was created for sarcasm detection only, this is the only context that can be learned from the dataset. As such, any model trained with this dataset will be unable to identify other forms of online abuse, thus limiting its usefulness. Waseem and Hovy (2016) annotated 17,000 tweets using labels like racism and sexism, and Davidson et al. (2017) labelled over 25,000 tweets based on the presence of offensive and hate speech. Chatzakou et al. (2017) extracted features to identify cyberbullies by clustering 9,484 tweets attributed to 303 unique Twitter users. In creating their bi-lingual dataset sourced from ASKfm, Hee et al. (2018) used a detailed labelling scheme that acknowledges the different types of cyberbullying discovered in the retrieved post types. The dataset's effectiveness in training classifiers may, however, be affected by the low percentage of abusive documents present. This dataset was subsequently re-annotated by Rathnayake et al. (2020) to identify which of the four roles of 'harasser', 'victim', 'bystander defender' and 'bystander assistant' was played by the authors of the posts contained in the dataset. Similarly, Sprugnoli et al. (2018) used the same four roles to annotate a dataset created from simulated cyberbullying episodes using the instant messaging tool; WhatsApp, along with the labels created by Hee et al. (2018) Zampieri et al. (2019) used a hierarchical annotation scheme that, in addition to identifying offensive tweets, also identifies if such tweets are targeted at specific individuals or groups and what type of target it is (i.e., individual -@username or group -'. . . all you republicans'). Hierarchical annotation schemes have indeed shown promise as observed in their use in recent offensive language detection competitions like hatEval 1 and Offen-sEval 2 ; that said, however, a hierarchical scheme could inadvertently filter out relevant labels depending on the first-level annotation scheme used. Ousidhoum et al. (2019) used one of the most comprehensive annotation schemes encountered in an existing dataset and additionally included a very high percentage of positive cyberbullying samples in their dataset but, regrettably, the number of English documents included in the dataset is small in comparison to other datasets. Founta et al. (2018) annotated about 10,000 tweets using labels like abusive, hateful, spam and normal, while Bruwaene et al. (2020) experimented with a multi-platform dataset comprising of 14,900 English documents sourced from Instagram, Twitter, Facebook, Pinterest, Tumblr, YouTube and Gmail. Other notable publicly available datasets include the Kaggle Insult (Kaggle, 2012) and Kaggle Toxic Comments (Kaggle, 2018) datasets. A comprehensive review of publicly available datasets created to facilitate the detection of online abuse in different languages is presented in Vidgen and Derczynski (2020).

Data
In this section, we introduce our dataset and how it addresses some of the limitations of existing datasets used in cyberbullying and online abuse detection research.

Objective
In reviewing samples of offensive tweets from Twitter and existing datasets, we discovered that a single tweet could simultaneously contain elements of abuse, bullying, hate speech, spam and many other forms of content associated with cyberbullying. As such, attributing a single label to a tweet ignores 1 competitions.codalab.org/competitions/19935 2 sites.google.com/site/offensevalsharedtask

Label Description Example
Bullying Tweets directed at a person(s) intended to provoke and cause offence. The target of the abuse must be from the tweet either via mentions or names.
@username You are actually disgusting in these slutty pictures Your parents are probably embarrassed. . .

Insult
Tweets containing insults typically directed at or referencing specific individual(s).
@username It's because you're a c*nt isn't it? Go on you are aren't you?
Profanity This label is assigned to any tweets containing profane words.
@username please dont become that lowkey hating ass f**king friend please dont Sarcasm Sarcastic tweets aimed to ridicule. These tweets may be in the form of statements, observations and declarations. other salient labels that can be ascribed to the tweet. We propose a multi-label annotation scheme that identifies the many elements of abusive and offensive content that may be present in a single tweet. As existing cyberbullying datasets often contain a small percentage of bullying samples, we want our dataset to contain a sizeable portion of bullying and offensive content and so devised querying strategies to achieve this. Twitter, being one of the largest online social networks with a user base in excess of 260 million (Statista, 2019) and highly representative of current social media usage, was used to source the data.

Labels
Berger (2007) (as cited in Abeele and Cock 2013, p.95) distinguishes two types of cyberbullying, namely direct and indirect/relational cyberbullying. Direct cyberbullying is when the bully directly targets the victim (typified by sending explicit offensive and aggressive content to and about the victim) while indirect cyberbullying involves subtler forms of abuse such as social exclusion and the use of sarcasm to ridicule. As both forms of cyberbullying exist on Twitter, our annotation scheme (see Table 1) was designed to capture the presence of both forms of bullying within tweets.

Collection Methods
Offensive and cyberbullying samples are often minority classes within a cyberbullying dataset; as such, one of our key objectives was ensuring the inclusion of a significant portion of offensive and cyberbullying samples within the dataset to facilitate training without the need for oversampling. Rather than indiscriminately mining Twitter feeds, we executed a series of searches formulated to return tweets with a high probability of containing the various types of offensive content of interest. For insulting and profane tweets, we queried Twitter using the 15 most frequently used profane terms on Twitter as identified by Wang et al. (2014). These are: f*ck, sh*t, a*s, bi*ch, ni**a, hell, wh*re, d*ck, p*ss, pu**y, sl*t, p*ta, t*t, damn, f*g, c*nt, c*m, c*ck, bl*wj*b, retard. To retrieve tweets containing sarcasm, we used a strategy based on the work of Rajadesingan et al. (2015) which discovered that sarcastic tweets often include #sarcasm and #not hashtags to make it evident that sarcasm was the intention. For our purposes, we found #sarcasm more relevant and therefore queried Twitter using this hashtag.
Formulating a search to retrieve tweets relating to social exclusion was challenging as typical examples were rare. From the 5000 tweets seed sample, we classified six tweets as relating to social exclusion and from them identified the following hashtags for use as query terms: #alone, #idontlikeyou and #stayinyourlane. Due to the low number of tweets returned for these hashtags, we also extracted the replies associated with the returned tweets and discovered the following additional hashtags #notinvited, #dontcometomyparty, and #thereisareasonwhy which were all subsequently used as additional query terms. Rather than excluding re-tweets when querying as is common practice amongst researchers, our process initially extracted original tweets and retweets and then selected only one of a tweet and its retweets if they were all present in the results. This ensured relevant content was not discarded in situations where the original tweet was not included in the results returned, but retweets were. Our final dataset contained 62,587 tweets published in late 2019.

Annotation Process
Language use on social media platforms like Twitter is often colloquial; this, therefore, influenced the desired annotator profile as that of an active social media user that understands the nuances of Twitter's colloquial language use. While there is no universal definition of what constitutes an active user on an online social network, Facebook defined an active user as someone who has logged into the site and completed an action such as liking, sharing and posting within the previous 30 days (Cohen, 2015). With one in every five minutes spent online involving social media usage and an average of 39 minutes spent daily on social media (Ofcom Research, 2019), this definition is inadequate in view of the increased users' activities on social media. An active user was therefore redefined as one that has accessed any of the major social networks (e.g., Twitter, Instagram, Facebook, Snapchat) at least twice a week and made a post/comment, like/dislike or tweet/retweet at least once in the preceding two weeks. This new definition is more in keeping with typical social media usage.
Using personal contacts, we recruited a pool of 17 annotators. Our annotators are from different ethnic/racial backgrounds (i.e., Caucasian, African, Asian, Arabian) and reside in different countries (i.e., US, UK, Canada, Australia, Saudi Arabia, India, Pakistan, Nigeria and Ghana). Additionally, their self-reported online social networking habits met our definition of an active social media user. All annotators were provided with preliminary information about cyberbullying including news articles and video reports, documentaries and YouTube videos as well as detailed information about the labelling task. Due to the offensive nature of the tweets and the need to protect young people from such content while maintaining an annotator profile close to the typical age of the senders and recipients of the tweets, our annotators were aged 18 -35 years.
Since the presence of many profane words can be automatically detected, a program was written to label the tweets for profane terms based on the 15 profane words used as query terms and the Google swear words list 3 . The profanity-labelled tweets were then provided to the annotators to alleviate this aspect of the labelling task. Each tweet was labelled by three different annotators from different ethnic/racial backgrounds, gender and countries of residence. This was done to control for annotators' cultural and gender bias.
An interesting observation of the annotation process was the influence of the annotators' culture on how labels are assigned. For example, we discovered that annotators from Asian, African and Arabian countries were less likely to assign the 'bullying', 'insult' and 'sarcasm' labels to tweets compared to annotators from the UK, Canada, US and Australia. A possible explanation for this could be that the context of the abuse apparent to the annotators from the Caucasian countries may not translate well to other cultures. While no other substantial trend were noticed for the other labels, this, however, highlighted the impact of an annotator's personal views and culture on the labelling task and the labels' composition of our dataset could have been different if we had sourced annotators from different cultures. As identified by Bender and Friedman (2018), researchers should therefore be mindful of potential annotators' biases when creating online abuse datasets.
Inter-rater agreement was measured via Krippendorff's Alpha (α) and the majority of annotators' agreement was required for each label. The Krippendorff python library 4 was used to compute the value which was found to be 0.67 which can be interpreted as 'moderate agreement'. We believe that the culturally heterogeneous nature of our annotators pool could have 'diluted' the agreement amongst annotators and contributed to the final value achieved.

Analysis
The number of tweets each label was assigned to is presented in Table 2 with 'Profanity' emerging as the dominant label and 'Exclusion' the least assigned label. It can also be seen that about a sixth of the tweets were not assigned any labels.  Before preprocessing, the maximum document length for the dataset was 167 characters with an average document length of 91. Following preprocessing, the maximum document length reduced to 143 characters (equating to 26 words) with an average document length of 67 characters. The removal of mentions (i.e., including a username with the @ symbol inside a tweet), URLs and non-ASCII characters were found to be the biggest contributor to document length reduction. There are 37,453 unique tokens in the dataset. Figure 1 illustrates 4 pypi.org/project/krippendorff the number of tweets assigned to multiple labels. Single label tweets make up more than a third of the dataset, which can be mostly attributed to the large number of tweets singly labelled as 'Profanity'. A significant number of tweets were also jointly labelled as 'Profanity' and 'Insult' or 'Insult' and 'Cyberbullying', and this contributed to doublelabelled tweets being the second-largest proportion of the dataset. Interestingly, there were more tweets with quadruple labels than there were with triple and this was discovered to be due to the high positive correlation between 'Porn'/'Spam' and 'Profanity'/'Insult.' The correlation matrix for the classes in the dataset is illustrated in Figure 2. The closer the correlation value is to 1, the higher the positive correlation between the two classes. The highest positive correlation is shown to be between 'Porn' and 'Spam' (0.91) followed by 'Insult' and 'Bullying' (0.41) and 'Insult' and 'Profanity' (0.25). 'Porn' and 'Spam' also demonstrated a positive correlation between them and 'Profanity' which can be attributed to the high proportion of profane terms in pornographic content and spam; we found that many pornographic tweets are essentially profanityladen spam. 'Insult' also exhibited a positive correlation with 'Bullying' and 'Profanity', a fact that can be attributed to the frequent use of profanity in insulting tweets as well as the use of insults to perpetrate bullying. The key negative correlations identified by the chart includes those between 'Bullying', and 'Porn' and 'Spam'. This can be attributed to bullying tweets often being personal attacks directed at specific individuals and typified by the use of usernames, person names or personal pronouns, all of which are rare in pornographic and spam tweets. The minority classes 'Sarcasm', 'Threat' and 'Exclusion' exhibited a minimal correlation with the other classes.

Bias Implication
Most datasets carry a risk of demographic bias (Hovy and Spruit, 2016) and this risk can be higher for datasets created using manually-defined query terms. Researchers, therefore, need to be aware of potential biases in datasets and address them where possible. Gender and ethnicity are common demographic biases that can be (often inadvertently) introduced into a dataset. To this end, we wanted to explore (as far as possible), whether our dataset had acquired gender bias. To do this we attempted to infer the gender of the users incorporated in our dataset. Since Twitter does not record users' gender information, we adopted an approach that uses the Gender API 5 to deduce the gender of users based on whether the users' first names are traditionally male or female: we assumed that as an accessible and feasible measure of users' gender identity. We were able to process the authorship of 13,641 tweets (21.8% of the dataset ) in this way and inferred that 31.4% of the authors of these tweets identified as female and 68.6% male (at least in so far as was apparent from their Twitter account). This suggests a male-bias in the authorship of the tweets in the dataset. We, however, recognise the limitation of this approach as the names provided by users cannot always be regarded as truthful and as gender extends beyond the traditional binary types, a names-based approach such as this cannot be used to deduce all gender identities. A more empathetic and effective means to identify gender in Twitter users would be an interesting facet of 5 https://gender-api.com future work.
With regards racial and ethnic bias, we mitigate potential bias by including generalised variants of any ethnicity-specific keyword used as a query term as well as including variants for different ethnicities. It should, however, be noted that the popularity and topicality of certain keywords may still introduce an unintended bias. For example, #blacklivematters returns several more tweets than #asianlivematters.
While the collection strategy used to create our dataset ensured a high concentration of offensive tweets, a potential consequence of the imbalanced distribution of the classes is that it may reinforce the unintentional bias of associating minority classes to specific hateful and offensive content. Dixon et al. (2018) defined unintended bias as when a model performs better for comments containing specific terms over others. For example, the phrase 'stay in your lane' was found in 4 of the 10 tweets identified as 'Exclusion' (due to the use of the hashtag #stayinyourlane as a query term), this can cause a model trained on the dataset to overgeneralised the phrase's association with the 'Exclusion' label, thus introducing a false positive bias in the model. Introducing more examples of the minority classes using a variety of query terms is a potential strategy for mitigating such unintended bias and is discussed further under future work.

Practical Use
Ultimately the aim of a dataset such as this is to train machine learning models that can subsequently be used in abuse detection systems. It is, therefore, crucial to understand how any bias in the dataset is manifested in the trained model and the impact of such bias in practical applications. A National Institute of Science and Technology (NIST) study (Grother et al., 2019) discovered that, for example, many US-developed facial recognition algorithms generated significantly higher false positives for Asian and African-American faces compared to Caucasian faces while similar algorithms developed in Asian countries did not show any such dramatic differences in false positive rates between Asian, African-American and Caucasian faces. The study concluded that the use of diverse training data is critical to the reduction of bias in such AI-based applications.
Our dataset has been used to train the classifier used in an online abuse prevention app (called BullStop) which is available to the public via the Google play store. The app detects offensive messages sent to the user and automatically deletes them. It, however, acknowledges the possibility of both false positive and negative predictions, and thus allows the user to review and re-classify deleted messages and uses such corrections to retrain the system. This is especially important for a subjective field such as online abuse detection.

Setup
Models for comparison We experimented with both traditional classifiers (Multinomial Naive Bayes, Linear SVC, Logistic Regression) and deep learning-based models (BERT, Roberta, XLNet, DistilBERT) to perform multi-label classification on the dataset. BERT (Bidirectional Encoder Representations from Transformers) is a language representation model designed to pre-train deep bi-directional representations from unlabeled text (Devlin et al., 2019). RoBERTa (Robustly Optimized BERT Pretraining Approach) is an optimised BERT-based model (Liu et al., 2019), and Dis-tilBERT (Distilled BERT) is a compacted BERTbased model (Sanh et al., 2019) that requires fewer computing resources and training time than BERT (due to using about 40% fewer parameters) while preserving most of BERT performance gains. XL-Net (Yang et al., 2019) is an autoregressive language model designed to overcome some of the limitations of BERT. BERT, RoBERTa, XLNet, and DistilBERT are available as pre-trained models but can also be fine-tuned by first performing language modelling on a dataset.
Evaluation Each model's performance was evaluated using macro ROC-AUC (Area Under ROC Curve), Accuracy, Hamming Loss, Macro and Micro F 1 Score, which are typically used in imbalanced classification tasks.

Preprocessing
The primary objective of our preprocessing phase was the reduction of irrelevant and noisy data that may hamper classifier training. As is standard for many NLP tasks, punctuation, symbols and non-ASCII characters were removed. This was followed by the removal of mentions and URLs. We also discovered many made-up words created by combining multiple words (e.g. goaway, itdoesntwork, gokillyourself) in the tweets. These are due to hashtags, typos and attempts by users to mitigate the characters limit imposed by Twitter. The wordsegment python library was used to separate these into individual words. The library contains an extensive list of English words and is based on Google's 1T (1 Trillion) Web corpus. 6 Lastly, the text was converted to lower case.

Results
We provide the stratified 10-fold cross-validation results of the experiments in Table 3. The best macro ROC-AUC score was achieved by the pretrained RoBERTa model, while the best macro and micro F 1 scores were attained using the pre-trained BERT and RoBERTa models, respectively. The best overall accuracy was returned by the finetuned DistilBERT model. As expected, the deep learning models outperformed the baseline classifiers with Multinomial Naive Bayes providing the worst results across the experiments and the BERT-like models achieving the best results for each metric. Interestingly, the pre-trained models were marginally better than the equivalent finetuned models implying that fine-tuning the models on the dataset degrades rather than improves performance.
As would be expected, the models performed better at predicting labels with higher distributions. For the minority classes like Sarcasm, Threat and Exclusion, RoBERTA and XLNet performed better. All the models performed well in predicting the none class, i.e. tweets with no applicable labels.
The resulting dataset from our collection methods is imbalanced with a high percentage of cyberbullying tweets. In reality, such a concentration of cyberbullying and offensive tweets is highly unusual and at odds with other cyberbullying datasets. To evaluate the generalisability of models trained on our dataset, we performed further experiments to evaluate how the models perform on other unseen datasets. We used our best performing model; RoBERTa (pre-trained), to perform prediction on samples extracted from two other datasets and compared the results against that achieved on our dataset by RoBERTa models trained on the other datasets.
The dataset created by Davidson et al. (2017) and the Kaggle Toxic Comments dataset (Kaggle, 2018) were selected for the experiments. We re-    Table 4.
Overall, models trained on our dataset (RoBERTa C→D and RoBERTa C→K ) perform better on the other two datasets than the models trained on the other datasets and tested on the Cyberbullying dataset (RoBERTa D→C , RoBERTa K→C ). Interestingly, models trained on our dataset achieved better ROC-AUC, Macro and Micro F 1 values on both the Davidson (D) and the Kaggle (K) datasets compared to in-domain results on those datasets (i.e., models trained and evaluated on the same datasets -RoBERTa D→D and RoBERTa K→K ). The results indicate that our dataset sufficiently captures enough context for classifiers to distinguish between both cyberbullying and non-cyberbullying text across different social media platforms.

Discussion and Future Work
Our collection strategy for creating the dataset was designed to target cyberbullying and offensive tweets and ensure that these types of tweets constitute the majority class. This differs from the collection strategies used in other datasets such as those by Dadvar et al. (2013), Kontostathis et al. (2013) and Hosseinmardi et al. (2015) which are designed to simulate a more realistic distribution of cyberbullying. As the occurrence of cyberbullying documents is naturally low, classifiers trained on our dataset can benefit from a high concentration of cyberbullying and offensive documents without the need for oversampling techniques.
When cross-domain evaluation was performed using our best performing classifier on two other datasets (Davidson et al., 2017;Kaggle, 2018), the model trained on our dataset performed better than those trained on the other datasets. It is also worth noting that the composition and annotation of these other datasets is entirely different from ours, and one was sourced from a different platform (Wikipedia). Our results demonstrated that deep learning models could learn sufficiently from an imbalanced dataset and generalise well on different data types.
We discovered a slight performance degradation for the deep learning-based models after finetuning. As recently shown in (Radiya-Dixit and Wang, 2020), fine-tuned networks do not deviate substantially from pre-trained ones and large pretrained language models have high generalisation performance. We will explore in future work, more effective ways for producing fine-tuned networks such as learning to sparsify pre-trained parameters and optimising the most sensitive task-specific layers.
The distribution of 'Sarcasm', 'Exclusion' and 'Threat' labels is low within the dataset. Consequently, the models' ability to predict these classes is not comparable to that of the majority classes. Increasing the distribution of these labels within the dataset will improve the models training and mitigate unintended bias that may have been introduced by the minority classes; we therefore plan to supplement the dataset with more positive samples of these classes by exploring other querying strategies as well as incorporating samples from existing datasets such as Rajadesingan et al. (2015) and Hee et al. (2018).

Conclusion
In this paper, we presented a new cyberbullying dataset and demonstrated the use of transformerbased deep learning models to perform fine-grained detection of online abuse and cyberbullying with very encouraging results. To our knowledge, this is the first attempt to create a cyberbullying dataset with such a high concentration (82%) of cyberbullying and offensive content in this manner and using it to successfully evaluate a model trained with the dataset on a different domain. The dataset is available at https://bitbucket.org/ssalawu/cyberbullyingtwitter for the use of other researchers.