SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.


Introduction
Offensive language in social media has become a concern for governments, online communities, and social media platforms. Free speech is an important right, but moderation is needed in order to avoid unexpected serious repercussions. In fact, this is so serious that many countries have passed or are planning legislation that makes platforms responsible for their content, e.g., the Online Harm Bill in the UK and the Digital Services Act in the EU. Even in USA, content moderation, or the lack thereof, can have significant impact on businesses (e.g., Parler was denied server space), on governments (e.g., the U.S. Capitol Riots), and on individuals (e.g., hate speech is linked to self-harm). As human moderators cannot cope with the volume, there is a need for automatic systems that can assist them. There have been several areas of research in the detection of offensive language (Basile et al., 2019;, covering overlapping characteristics such as toxicity, hate speech, cyberbullying, and cyber-aggression. Moreover, it was proposed to use a hierarchical approach to analyze different aspects, such as the type and the target of the offense, which helps provide explainability. The Offensive Language Identification Dataset, or OLID, (Zampieri et al., 2019a) is one such example, and it has been widely used in research. OLID contains 14,100 English tweets, which were manually annotated using a three-level taxonomy: A: Offensive Language Detection B: Categorization of Offensive Language C: Offensive Language Target Identification This taxonomy makes it possible to represent different kinds of offensive content as a function of the type and the target. For example, offensive messages targeting a group are likely to be hate speech, whereas such targeting an individual are probably cyberbullying. The taxonomy was also used for languages such as Arabic (Mubarak et al., 2021), and Greek (Pitenis et al., 2020), allowing for multilingual learning and analysis.
An inherent feature of the hierarchical annotation is that the lower levels of the taxonomy contain a subset of the instances in the higher levels, and thus there are fewer instances in the categories in each subsequent level. This makes it very difficult to train robust deep learning models on such datasets. Moreover, due to the natural infrequency of offensive language (e.g., less than 3% of the tweets are offensive when selected at random), obtaining offensive content is a costly and time-consuming effort. Here, we address these limitations by proposing a new dataset: Semi-Supervised Offensive Language Identification Dataset (SOLID).

arXiv:2004.14454v2 [cs.CL] 24 Sep 2021
Our contributions are as follows: 1. We are the first to apply a semi-supervised method for collecting new offensive data using OLID as a seed dataset, thus avoiding the need for time-consuming annotation.
2. We create and publicly release SOLID, a training dataset containing 9 million English tweets for offensive language identification, the largest dataset for this task. 1 SOLID is the official dataset of the SemEval shared task OffensEval-2020 .
3. We demonstrate sizeable improvements over prior work on the middle and the lower levels of the taxonomy, where gold training data is scarce, when training on SOLID and testing on OLID.
4. We provide a new larger test set and a comprehensive analysis of EASY (i.e., simple explicit tweets such as using curse words) and HARD (i.e., more implicit tweets that use underhanded comments or racial slurs) examples of offensive tweets.
The remainder of this paper is organized as follows: Section 2 presents related studies in aggression identification, cyberbullying detection, and other related tasks. Section 3 describes the OLID dataset and the annotation taxonomy. Section 4 introduces our computational models. Section 5 presents the SOLID dataset. Section 6 discusses the experimental results and Section 6.3 offers additional discussion and analysis. Finally, Section 7 concludes and discusses possible directions for future work.

Related Work
There have been several recent studies on offensive language, hate speech, cyberbulling, aggression, and toxic comment detection. See (Nakov et al., 2021) for an overview. Hate speech identification is by far the most studied abusive language detection task (Ousidhoum et al., 2019;Chung et al., 2019;Mathew et al., 2021). One of the most widely used datasets is the one by Davidson et al. (2017), which contains over 24,000 English tweets labeled as non-offensive, hate speech, and profanity. A recent shared task on this topic is HatEval (Basile et al., 2019).
For cyberbullying detection, Xu et al. (2012) used sentiment analysis and topic models to identify relevant topics. Dadvar et al. (2013) andSafi Samghabadi et al. (2020) studied utility of the conversational context. In particular, Dadvar et al. (2013) used user-related features such as the frequency of profanity in the previous messages. More recent work has addressed the issues of scalable and timely detection of cyberbullying in online social networks. To this end, Rafiq et al.
(2018) used a dynamic priority scheduler, and Yao et al. (2019) proposed a sequential hypothesis testing. Safi Samghabadi et al. (2020) constructed a dataset of cyberbullying episodes from the semianonymous social network ask.fm.
There were two editions of the TRAC shared task on Aggression Identification (Kumar et al., 2018(Kumar et al., , 2020, which provided participants with datasets containing annotated Facebook posts and comments in English and Hindi for training and validation. Then, Facebook and Twitter datasets were used for testing. The goal was to discriminate between three classes: non-aggressive, covertly aggressive, and overly aggressive. Two other shared tasks addressed toxic language. The Toxic Comment Classification Challenge 2 at Kaggle provided participants with comments from Wikipedia annotated using six labels: toxic, severe toxic, obscene, threat, insult, and identity hate. The recent SemEval-2021 Toxic Spans Detection shared task addressed the identification of the token spans that made a post toxic (Pavlopoulos et al., 2021).
In this paper, we extend the prior work of the OLID dataset (Zampieri et al., 2019a). It is annotated using a hierarchical annotation schema as in (Basile et al., 2019;Mandl et al., 2019). In contrast to prior approaches, it takes both the target and the type of offensive content into account. This allows multiple types of offensive content (e.g., hate speech and cyberbullying) to be represented in OLID's taxonomy. Herem we create a large-scale semi-supervised dataset using the same annotation taxonomy as in OLID.

The OLID Dataset
The OLID (Zampieri et al., 2019a) dataset tackles the challenge of detecting offensive language using a labeling schema that classifies each example using the following three-level hierarchy: Level A: Offensive Language Detection Is the text offensive? OFF Inappropriate language, insults, or threats. NOT Neither offensive, nor profane.

Level B: Categorization of Offensive Language
Is the offensive text targeted? TIN Targeted insult or threat towards a group or an individual.
Level C: Offensive Language Target Identification What is the target of the offense? IND The target is an individual explicitly or implicitly mentioned in the conversation; GRP Hate speech targeting a group of people based on ethnicity, gender, sexual orientation, religion, or other common characteristic. OTH Targets that does not fall into the previous categories, e.g., organizations, events, and issues.
The taxonomy was successfully adopted for several languages (Mubarak et al., 2021;Pitenis et al., 2020;Sigurbergsson and Derczynski, 2020;Çöltekin, 2020), and it was used in a series of shared tasks (Zampieri et al., 2019b;Mandl et al., 2019). Tweets from the OLID dataset labeled with the taxonomy are shown in Table 1. The OLID dataset consists of 13,241 training and 860 test tweets. Table 2 presents detailed statistics about the distribution of the labels. There is a substantial class imbalance at each level of the annotation, especially at Level B. Moreover, there is a sizable difference in the total number of annotations between the levels due to the schema, e.g., Level C is 30% smaller than Level A, and the data sizes for B and C are rather small. These drawbacks indicate the need to create a larger dataset.

Models
In this section, we describe the models used for semi-supervised annotation and for evaluating the contribution of SOLID for offensive language identification. We use a suite of heterogeneous machine learning models: PMI (Turney and Littman, 2003), FastText (Joulin et al., 2017), LSTM (Hochreiter and Schmidhuber, 1997), and BERT (Devlin et al., 2019), which have diverse inductive biases. This is an essential prerequisite for our semi-supervised setup (see Section 4.5), as we assume that an ensemble of models with different inductive biases would ecrease each individual model's bias.

PMI
We use a PMI-based model that computes the ngram-based similarity of a tweet to the tweets of a particular class c in the training dataset. The model is considered naïve as it accounts only for the ngram frequencies in the discrete token space and only in the context of n neighboring tokens. We compute the PMI score (Turney and Littman, 2003) of each n-gram in the training set w.r.t. each class: where p(w i , c j ) is the frequency of n-gram w i in instances of class c j , p(w i ) is the frequency of n-gram w i in instances from the entire training dataset, and p(c j ) is the frequency of class c j . Additionally, we find that semantically oriented PMI scores (Turney and Littman, 2003) improve the performance of this naïve method: where C \{c j } is the set of all classes except c j . At training time, we collect the frequencies of the n-grams on the training set. At inference time, we use the frequencies to calculate PMI and PMI-SO scores for each unigram and bigram in each instance, and then we average PMI and PMI-SO into a single score for each instance and class. Finally, we select the class with the highest score. If the instance contains no words with associated scores, we choose NOT for Level A, UNT for Level B (i.e., the classes most likely to contain neutral orientation), and the majority class IND for Level C. We remove words appearing less than five times in the training set, and we add a smoothing of 0.01 to each frequency.

FastText
A suitable extension to the word-based model is to use subword representations to overcome the naturally noisy structure of the tweets. FastText (Joulin et al., 2017) is a subword model, which has shown strong performance on various tasks without the need for extensive hyper-parameter tuning. It uses a shallow neural model for text classification similar to the continuous bag-of-words model (Mikolov et al., 2013). However, instead of predicting the word based on its neighbors, it predicts the target label based on the sample's words. FastText offers a valuable, diverse modeling representation to the ensemble due to its differences with the simple PMI model and the heavy-lifting LSTM and BERT models. We train FastText with bigrams and a learning rate of 0.01 for Levels A and B, and with trigrams and a learning rate of 0.09 for Level C. All tasks use a window size of five and a hierarchical softmax loss.

LSTM
Unlike the above models (PMI and FastText), an LSTM model (Hochreiter and Schmidhuber, 1997;Vaswani et al., 2017) can account for long-distance relations between words. Our LSTM model has an embedding layer, which we initialize with a concatenation of 300-dimensional GloVe embeddings (Pennington et al., 2014) and 300-dimensional Fast-Text Common Crawl embeddings (Grave et al., 2018). Then, follow a dropout layer, followed by a bi-directional LSTM layer with an attention mechanism on top of it. Next, we concatenated the attention mechanism's output with averaged and maximum global poolings on the outputs of the LSTM model. The final prediction was produced by a sigmoid layer for Levels A and B, where we have a binary classification, and a by softmax layer for Level C, where we have three classes. We trained the LSTM model using early stopping with patience for no improvements over the validation loss of up to five epochs.
In terms of dimensionality, for Level A, we used a hidden size of 128, a dropout rate of 0.3, a batch size of 256, and a learning rate of 0.0002. For levels B and C, we used a hidden size of 50, a dropout rate of 0.1, a batch size of 32, and a learning rate of 0.0001. Finally, we used the Adam optimizer for training.

BERT
Recently, the Transformer architecture (Vaswani et al., 2017) has demonstrated state-of-the-art performance for several NLP tasks, offering both high representational power and robustness. Here, we exploit the benefits of transfer learning in a lowresource setting by using the pre-trained BERT model (Devlin et al., 2019), which we fine-tune for our tasks (i.e., classification for each of the three levels of the taxonomy). In our experiments, we use the base uncased BERT model implementation from HuggingFace, which has 12 layers, a hidden size of 768, and 12 attention heads, amounting to 110 million parameters. We then fine-tune the model for 2, 3, and 3 epochs for Levels A, B, and C, respectively. We use learning rates of 0.00002 for Levels A and B, and 0.00004 for Level C. We apply per-class weights to cope with the data imbalance in Level C as follows: IND=1, GRP=2, OTH=10. We use the Adam optimizer and a linear warm-up schedule with a 0.05 warm-up ratio.

Democratic Co-training
Democratic co-training (Zhou and Goldman, 2004) is a semi-supervised technique, commonly used to create large datasets with noisy labels when provided with a set of diverse models trained in a supervised way. It has been successfully applied in tasks like time series prediction with missing data (Mohamed et al., 2007), early prognosis of academic performance (Kostopoulos et al., 2019), as well as for tasks in the health domain (Longstaff et al., 2010). In our case, we use models with diverse inductive biases to label the target tweet, which can help ameliorate the individual model biases, thus yielding predictions with a lower degree of noise.
In particular, we use democratic co-training to generate semi-supervised labels for all three levels of the SOLID dataset, using OLID as a seed dataset, and applying distant supervision using an ensemble of the above-described models as follows:  Table 2: Statistics about the training and the testing data distribution for the OLID and the SOLID datasets.

The SOLID Dataset
In this section, we describe the process of collecting and annotating data for SOLID. We collect a large set of over 12 million tweets, and we labeled nine million of them using the democratic cotraining approach described in the previous section. Table 2 shows some statistics about the resulting dataset for each level of the taxonomy.

A Large-Scale Dataset of Tweets
We collected our data in 2019 from Twitter using the Twitter streaming API 3 and Twython 4 . We searched the API using the twenty most common English stopwords (e.g., the, of, and, to) to ensure truly random tweets and to avoid rate limits imposed by the Twitter platform. Using stopwords ensured that we were more likely to obtain English tweets as well as a diverse set of random tweets. We kept the stream tweet collection running the entire time and we continuously chose a stopword at random based on its frequency in Project Gutenberg, a sizeable monolingual corpus. For each query, we collected 1,000 tweets per stopword. Thus, frequent stopwords were used more frequently to collect tweets. A full list of the stopwords and their frequency is shown in Appendix A.1. We used this data collection approach in an attempt to try to help mitigate the biases found in OLID. OLID was collected using a predefined list of keywords that were more likely to retrieve offensive tweets, which caused offensive tweets in OLID to be explicit and easier to classify. In contrast, in our case here, the tweets we collected for SOLID contain implicit and explicit offensive text. This allows us to study the performance of various models in hard classification cases.  Table 3: Macro-F1 score of the models in the democratic co-training ensemble on the OLID test set.
We used the langdetect tool 5 to select English tweets, and we discarded tweets with less than 18 characters or such that were less than two words long. We substituted all user mentions with @USER for anonymization purposes. We also ignored tweets with URLs as such that did not tend to be offensive and might be less self-contained, e.g., that could have a link to an article, an image, a video, etc. Understanding such tweets would require going beyond their purely textual content. In total, we collected over twelve million tweets. We kept nine million as training data, and we created a new test set from a portion of the remaining three million tweets.

Semi-Supervised Training Dataset
We used the democratic co-training setup described in Subsection 4.5 to create the semi-supervised dataset. We first trained each model on the OLID dataset using 10% of the training dataset for validation. The performance of the individual models on the OLID dataset is shown in Table 3. We can see that BERT is the best model for Level A, and that PMI performs almost on par with the LSTM model. We believe that this is due to the size of the dataset and to the fact that a simple lexicon of curse words would be highly predictive of the offensive content present in a tweet. The performance of the FastText model is the lowest by 2 points.
For Level B, BERT performs best, followed by the LSTM model. The task is more challenging at this level for frequency and n-gram-based approaches such as PMI and FastText.
Finally, the overall performance of the models at Level C decreases further. This is expected as the size of the dataset becomes smaller, and the task becomes one of three-way classification, whereas Levels A and B are two-way. Here, BERT and LSTM outperform FastText and PMI, with BERT being the best model.   The decrease in performance in the final level can lead to increased noise in the semi-supervised labels, but we use an ensemble of four models, and we provide the average and the standard deviation of the confidence across the models on each instance to mitigate this. As we show later, these scores can be successfully used to filter out a large amount of noise in the semi-supervised dataset, thus yielding performance improvements.
We computed the aggregated single prediction based on the average and the standard deviation of the confidences predicted by each of the models: In particular, we computed the scores based on the confidences for the positive class at Levels A and B, and on the confidences for IND, GRP, and OTH at Level C. We performed the above aggregation step instead of just using the scores of each model in order to avoid over-fitting to any particular model in the ensemble. This also helps to prevent biases with respect to individual models in future uses of the dataset. Moreover, the standard deviation and the average scores can be used to filter instances that the models disagree on, thus reducing the potential noise in the semi-supervised annotations.
We labeled the dataset in this semi-supervised manner by first assigning a Level A label to all the tweets. Then, we selected the subset of tweets that were likely to be offensive for all models (BERT and LSTM ≥ .5, PMI and FT=OFF) as instances that should be assigned a label for Level B. Finally, for Level C, we chose the tweets that were likely to be TIN at Level B with a standard deviation lower than 0.25. Thus, only the instances that were most likely to be offensive were considered at Levels B and C, and only those that were most likely to be offensive and targeted were considered at Level C. The size and the label distribution across the datasets can be found in Table 2 and examples of tweets along with model prediction confidences can be found in Table 4.

SOLID Test Dataset
As the OLID test set was very small, particularly for Levels B and C, we also annotated a portion of our held-out three million tweets in order to create a new SOLID test set to obtain more stable results and to analyze the performance of various models in more detail.
First, all co-authors of this paper (five annotators) annotated 48 tweets that were predicted to be OFF in order to measure inter-annotator agreement (IAA) using P 0 = agreement_per_annotation total_annotations * num_annotators . We found IAA to be 0.988 for Level A; an almost perfect agreement for OFF/NOT. The IAA for Level B was 0.818, indicating a good agreement on whether the offensive tweet was TIN/UNT. Finally, for Level C, the IAA was 0.630, which is lower but still considered reasonable, as Level C is more complicated due to its 3-way annotation schema: IND/GRP/OTH. Moreover, while a tweet may address targets of different types (e.g., both an individual and a group), only one label can be chosen for it.
After having observed this high IAA, we annotated additional offensive tweets with a single annotation per instance. We divided our Level A data into four portions based on model confidence:  Note that PMI=OFF and FT=OFF designates that the model's probability is higher for OFF than for NOT. We selected the rest of the thresholds after a manual examination of the confidence scores for each model. We chose the threshold where the model is confident and mostly correct.
We annotated 3,493 tweets for Level A. The number of annotations at each level is shown above in square brackets. Moreover, in order to create a complete test dataset for Level A (where we only labeled offensive tweets), we also took a random set of 2,500 Easy NOT tweets. The resulting test sizes are shown in Table 2. Of the 3,493 annotated tweets, 491 were judged to be NOT. In total, there were 5,993 tweets in our test set. In all cases, we annotated all three levels, but the decision about whether a tweet in Level B/C is Easy or Hard is still based on its Level A confidence. Table 5 shows some tweets and whether they are Easy OFF/NOT (lines 1-4) or Hard OFF/NOT (lines 5-8), and Table 6 shows statistics about the Easy and the Hard examples in the test dataset. Note that determining the labels for the Hard examples is not simple and the model does make incorrect predictions such as in lines 5 and 8 of Table 5. In fact, 25% of the Hard OFF tweets that we annotated were NOT. In contrast, 8% of the Easy OFF tweets were judged to be NOT.

Experiments and Evaluation
Below, we describe our experiments and evaluation results on the OLID test set when training on OLID + SOLID compared to training on OLID only.

Experimental Setup
We used the BERT and the FastText models from the semi-supervised annotation setup to estimate the improvements when training on the supervised dataset OLID together with the semi-supervised SOLID. The models in all sets of experiments were fine-tuned on a 10% validation split of the training set used during co-training. We explored different ways to combine OLID and SOLID, and different thresholds for the confidence of the instances in SOLID. We achieved improvements for Levels B and C by upsampling the underrepresented classes: we sampled K instances of each class, where K is the number of instances for the most frequent class. We also removed the warm-up in Levels B and C, which improved the results further.
FastText. The FastText model is implemented as an external command-line tool, which does not give us much control over training. Thus, we trained models on the combined training sets of OLID and SOLID. The FastText model had the same parameters used above in co-training.  BERT. Due to the computational requirements of BERT, we subsampled 20,000 tweets from SOLID in Levels A and B; in fact, using more instances did not help. During training, we used SOLID in the first epoch and OLID in the following two epochs for Level A. Using SOLID after training with OLID yielded worse results, which is probably because the semi-supervised dataset by construction contains somewhat noisy labels. Yet, it can be used as an initial step to guide the model towards a better local minimum. On the other hand, we conjecture that the supervised dataset is better suited for finetuning the model towards the local minimum with the gold data, particularly in Level A, where the training split of OLID is already sufficient for training BERT. For Levels B and C, we trained for two epochs with the training split of OLID and then for one epoch with SOLID. At Levels B and C, we observed that training with SOLID in the first epochs and then fine-tuning with OLID did not improve the performance. Moreover, training with OLID and then using SOLID for the final epochs yielded substantial performance improvements. We assume this is due to the small training size of OLID, which can cause the model to overfit to a suboptimal local minimum when used in the final training epochs.
Selecting SOLID Instances. We filtered the training instances from SOLID to be the most confident examples based on the average probability score provided in SOLID when training using Fast-Text and BERT. We chose the threshold for the average confidence score based on the validation dataset as follows: We selected the labels as follows: in Level A, NOT when avg(OFF) < 0.20, else OFF; in Level B, UNT when avg(UNT) > 0.65, else TIN; in Level C, the class with the highest probability.

OLID Results
In this section, we describe our results when testing on the OLID test set. We compare training on OLID vs. training on OLID + SOLID. The results are shown in Table 7.
We can see that for Level A, when training with OLID+SOLID, the results improve for FastText, which is a weak model (see also Table 3). However, for BERT, which already performs very strongly when fine-tuned with OLID only, there is not much difference when SOLID is added; in fact, there is even a small degradation in performance. These results are in line with findings in previous work (Longstaff et al., 2010), where it was observed that democratic co-training performs better when the initial classifier accuracy is low.
For Level B, the OLID training dataset is smaller, and the task is more complex. Thus, there is more benefit in adding SOLID, which yields sizable improvements for both BERT and FastText. Yet, as FastText is a much weaker model (in fact, performing the same as the majority class baseline when trained on OLID only), the absolute gain for it is much larger than for BERT: 12.1 vs. 4.2 macro-F1 points absolute.
Finally, for Level C, the manually annotated OLID dataset is even smaller, and the number of classes increases from two to three. As a result, BERT benefits from adding the SOLID data by a large margin of 5.4 macro-F1 points absolute. However, using SOLID for FastText does not help. This might be due to FastText already achieving high performance when trained with OLID only (see Table 3), which is on par with that of BERT, while democratic co-training performs well when the initial classifier's performance is low.

SOLID Results
Above, we have demonstrated sizable improvements when training on a combination of the OLID and the SOLID datasets, and testing on the test part of OLID. However, OLID is small, and thus the results could be unstable, especially for Levels B and C. Thus, evaluating on a larger set, namely the test set of SOLID, is important for estimating the model stability. We also focus on Easy vs. Hard examples (based on the confidence computed during co-training) to gain better insight into why some tweets are easier to classify as offensive than others. The results are shown in Table 8 and they beat the majority class baselines by a huge margin.  We can see that the results for Level A on SOLID test are 0.923 and 0.860 macro-F1 for BERT and for FastText, respectively, with a small improvement when OLID is augmented with SOLID for FastText only. This is consistent with what we found on the OLID test set. Note that the full results for Level A are much better than those on the OLID test dataset in Table 7. We believe that this is partially due to our selection of tweets for the new test set, indicating that there are more Easy tweets in it. Similar findings to the full test set occur with the Easy tweets, but the scores this time are even higher. On the other hand, for the Hard tweets, the results are much lower at 0.570 and 0.536 for BERT and for FastText, respectively. Overall, using SOLID yields a nice improvement for both models on the Hard tweets, which was not evident in the OLID test set in Table 7.
In order to gain further insight into why the results are so high for Easy OFF tweets at Level A, we implemented a curse-word baseline using the absence and the presence of 22 curse words (the list can be found in Appendix A.1). We found that most Easy tweets were classified correctly by this baseline with an F1-score of 0.936. In contrast, the curse-word baseline was not effective on the hard examples, just like the BERT and the FastText models were not. It achieved a macro-F1 score of 0.580, which is one point higher than the BERT result. Thus, we can conclude that both BERT and FastText are probably overfitting to the curse words to some extent. The Hard tweets are offensive due to other language use such as negative biases rather than the appearance of a curse word such as in examples 6 and 8 in Table 5. Classifying such tweets correctly remains an open challenge not only for our models, but also in general.
The difference between Easy OFF/NOT and Hard OFF/NOT tweets is less pronounced for Levels B and C. The curse word imbalance may have a small impact on the lower levels as UNT tweets are more likely to contain curse words. In all cases, combining SOLID and OLID for Levels B and C yields a sizable improvement, indicating that the larger test set can better showcase the differences, leading to more stability. The results for Levels B and C vary greatly for the two models compared to those on the OLID test set in Table 7, which points to the challenges of having a small test set.

Conclusion and Future Work
We have presented SOLID, a large-scale semisupervised training dataset for offensive language identification, which we created using an ensemble of four different models. To the best of our knowledge, SOLID is the largest dataset of its kind, containing nine million English tweets. We have shown that using SOLID yields noticeable performance improvements for Levels B and C of the OLID annotation schema, as evaluated on the OLID test set. Moreover, in contrast to using keywords, our approach allows us to distinguish between Hard and Easy offensive tweets. The latter enables us to have a deeper understanding of offensive language identification and indicates that detecting Hard offensive tweets is still an open challenge. Our work encourages safe and positive places on the web that are free of offensive content, especially non-obvious cases, i.e., Hard. SOLID was the official dataset of the SemEval shared task OffensEval 2020 .
In the future, we would like to provide insights and methods for categorizing Hard tweets.

Ethics Statement
Dataset Collection We collected both the OLID and the SOLID datasets using the Twitter API. The OLID dataset was collected using keywords that would be more likely to be accompanied by offensive tweets (Zampieri et al., 2019a), while the SOLID dataset was collected by querying with frequent stop words (see Section 5). Overall, we followed the terms of use outlined by Twitter. 6 Specifically, we only downloaded public tweets, and we provided only the user ids of those tweets in order to ensure that deleted tweets will no longer be part of our dataset. Moreover, in all our examples in this paper, we anonymized the user names in the tweets. Since no private information is stored, IRB approval is not required. All annotations were performed internally by the authors of the paper.
Biases We note that determining whether a piece of text is offensive can be subjective, and thus it is inevitable that there would be biases in our gold labeled data. It is expected that such biases will, therefore, also be present in the semi-supervised dataset we generated from such tweets.
While we cannot ensure that no biases occur in the gold data, we addressed these concerns by following a well-defined schema, which sets explicit definitions for offensive content during annotation. Our high inter-annotator agreement makes us confident that the assignment of the schema to the data is correct most of the time.
Using semi-supervised techniques to create a large dataset, SOLID, can cause the biases found in the gold data to be expanded further. We mitigated this in two ways. First, we gathered tweets based on the most frequent words in English to ensure a random set of initial tweets. Next, we constructed an ensemble of models with diverse inductive biases to label the target tweet, which can help to ameliorate the individual model biases and to produce predictions with a lower degree of noise. At test time, we aimed to have a meaningful ratio of offensive and non-offensive tweets based on a random collection of tweets. We also labeled all testing offensive tweets manually. The aim of these steps was to help reduce the potential biases. Please refer to Section A.2 of the Appendix for some analysis that shows the diversity of the models.
We acknowledge that current semi-supervised techniques do not address the problem of potential biases, which is inherent in the semi-supervised data coming from the supervised source model(s), which can also be studied in future work. We further acknowledge that biases can still exist in the ratio of offensive to non-offensive tweets in our dataset. In general, the size and the method of collection for the SOLID dataset mean that biases are hard to avoid.
Moreover, offensive language can vary depending on demographics, such as the gender of the targeted individual; the target could even be a particular gender group. Such biases, which are present in natural language data (Olteanu et al., 2019), are an important direction for future work.
Misuse Potential Most datasets compiled from social media present some risk of misuse. We therefore ask researchers to be aware that the SOLID dataset can be maliciously used to unfairly moderate text (e.g., a tweet) that may not be offensive based on biases that may or may not be related to demographic and/or other information present within the text. Intervention by human moderators would be required in order to ensure that this does not occur.

Intended Use
We have presented the SOLID dataset with the aim to encourage research in automatically detecting and stopping offensive content from being disseminated on the web. Such systems can be used to alleviate the burden for media moderators, which can suffer from psychological disorders due to the exposure of extremely offensive content. Improving the performance of offensive content detection systems can decrease the amount of work for human moderators, but human supervision would still be required for more intricate cases and in order to ensure that the system is not causing harm. With the possible ramifications of a highly subjective dataset, we distribute SOLID for research purposes only, without a license for commercial use. Any biases found in the dataset are unintentional, and we do not intend to cause harm to any group or individual.
We believe that this dataset is a useful resource when used in the appropriate manner and that it has great potential to improve the performance of current offensive content detection and automatic content moderation systems.

A Appendix
Below, we provide additional details about the data collection, we perform analysis, and we give some implementation details.

A.1 Data Collection and Analysis
In Section 5.1, we described our method for collecting tweets. We queried the Twitter API using the most frequent English words based on the large monolingual Project Gutenberg corpus. 7 Table 9 shows the top-20 most frequent words in the corpus and their frequency, which we used to collect the tweets. The normalized value is the percentage of the total frequency for the first N most frequent words. To choose a word, we generate a random number between 0 and 1, and we select the word corresponding to the smallest number that is higher than the generated one. For example, 0.45 would correspond to the word to.  In Section 6.3, we discussed the simple curseword baseline used to analyze the Easy OFF/NOT tweets. Table 10 gives the list of the 22 curse words that we used in that baseline. ass arse wtf lmao fuck bitch nigga nigger cunt effing shit hell damn crap bastard idiot stupid racist dumb f*ck pussy dick

A.2 Implementation Details
We fine-tuned the models on 10% of the OLID dataset. All models were trained on an NVIDIA Titan X GPU with 8GB of RAM. The performance of the individual models in our ensemble for semisupervised labelling is shown in Table 11. The evaluation measure we used for all experiments is macro-F1 score, as implemented in scikit-learn. 8  Table 11: Macro-F1 score, on the validation set, for the models used in the ensemble for Levels A, B, and C.

Model
In Table 12, we show the agreement between the models for the task prediction. For Levels A and B, it is more common that all four models agree, while in Level C, there are more cases when at least one model disagrees with the rest. Moreover, in Level A, there are almost no cases when the decision is tied with two models disagreeing with the other two. Finally, as in Level C the performance of the models is lower, the disagreement between the models in the ensemble is the largest and it is least common for all four models to agree on a prediction. Given the observed agreement rates, we conclude that there is considerable variance in the predictions across the models, especially for the lower levels. This indicates that the individual models can have differences in their predictions, which can be resolved by the ensemble combination in the democratic training setup.