Toxic Comment Collection: Making More Than 30 Datasets Easily Accessible in One Unified Format

With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects. To overcome these issues, we present a collection of more than thirty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data.

With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects.
To overcome these issues, we present a collection of more than forty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data.

Toxic Comment Datasets
Supervised machine learning and more specifically supervised deep learning is the current state-of-theart for text classification in general and for toxic comment classification in particular (van Aken et al., 2018). The performance of these classifiers depends heavily on the size and quality of available training data, which is mostly used for fine-tuning general language models. The rather small sizes of annotated toxic comment datasets dates from the high costs for obtaining high-quality labels and the high variety of the task itself. For each language and each specific set of labels (racism, attack, hate, abuse, offense, etc.) new training and test datasets are needed. To circumvent this need, transfer learning can be adapted up to a certain degree (Bigoulaeva et al., 2021;. As a result, many researchers have created their own training and test datasets customized to their specific use cases. Three recent surveys compare and discuss datasets used in the literature for hate speech and abusive language detection (Madukwe et al., 2020;Poletto et al., 2020;Vidgen and Derczynski, 2020). These overviews help to assess the dataset landscape but stop short of doing the next step: integrating and unifying the various datasets and making them easily accessible.
In this paper, we present a software tool that provides easy access to many individual toxic comment datasets using a simple API. The datasets are in a unified data format and can be filtered based on metadata. The collection currently contains datasets in thirteen different languages: Arabic, Danish, English, French, German, Greek, Hindi, Indonesian, Italian, Marathi, Portuguese, Slovenian, and Turkish. Further, it covers a wide range of labels of different kinds of toxicity, e.g., sexism, aggression, and hate. The code is available in a GitHub repository 1 and also as a PyPI package 2 so that users can easily install it via the command pip install toxic-comment-collection and import datasets from the collection within python.
With our tool, researchers can combine different datasets for customized training and testing. Further, it fosters research on toxic comments and the development of robust systems for practical application. Important research and practical questions that can be investigated with our provided tool are: 1. How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets?

Unified Toxic Comment Collection
Creating a unified collection of toxic comment datasets comes with several challenges. First, the datasets are stored on various platforms and need to be retrieved. Second, different file formats of the datasets complicate data integration, and third, the different sets of class labels need to be mapped to a common namespace. This section describes how the creation of our collection addresses these two challenges and presents statistics of the collection.

Collection Creation
We consider all publicly accessible comment datasets for the collection that contain labels that are subclasses of toxicity, such as offensive language, abusive language, and aggression. The broad definition of toxicity as a higher-level concept builds a bridge between the different lowerlevel concepts. The term denotes comments that contain toxic language and was made popular by the Kaggle Challenge on Toxic Comment Classification in 2018, which defined toxic comments as comments that are likely to make a reader leave a discussion. 3 We exclude datasets that consider users instead of comments as the level of annotation (Chatzakou et al., 2017;Ribeiro et al., 2018) or study a different type of conversation, e.g., What-sApp chats, where the participants presumably know each other in person (Sprugnoli et al., 2018). The datasets that we collected come from various sources, such as GitHub repositories, web pages of universities, or google drive and other file storage platforms. Even more diverse than the different source platforms are the file formats of the datasets. From csv files with different column separators and quoting characters, over excel sheets, sql dumps, to txt files with single records spanning multiple rows, optionally compressed as zip or tar files -converting all these formats into the same standardized csv format of our collection is the second step of the data integration after the datasets are retrieved.
The third step focuses on the class labels. These labels are encoded in different ways. In the simplest format, there is a single column that contains one string per row, which is the class label. In some datasets, the class labels are encoded with integers, presumably to reduce file size. For multi-label classification datasets, the column might contain a list of strings or lists of integers. We unify the format of the labels to lists of strings.
More importantly, we create a mapping of class labels so that labels with the same meaning but different names are replaced with the same label. This mapping is stored in a configuration file and can be customized by users. Different use cases require different mappings. For example, one mapping can be used to map all datasets in the collection to a binary classification task of toxic and non-toxic comments. The next section describes the effect of this mapping on the toxic comment collection and other statistics of collection in the next section.

Collection Statistics
The collection contains comments in thirteen different languages, from twelve platforms, and with 162 distinct class labels (before mapping them to a smaller set of class labels). There is a large set of labels that occurs only in one dataset, with each label referring to a particular subclass of toxicity and target, e.g., female football players as in the dataset by Fortuna et al. (2019).
After combining similar names through our mapping strategy, 126 class labels remain, with 57 of them occurring in more than 100 samples. The total number of samples is currently 812,993. We are constantly adding more datasets.
As described in the previous section, a mapping can also be used to create a binary view on the collection with only two class labels: toxic and nontoxic. To this end, the class labels none (471,871 comments), normal (37,922 comments), other (2,248 comments), positive (4,038 comments), and appropriate (2,997 comments) are mapped to nontoxic (519,076 comments). The labels idk/skip (73 comments) are discarded and all other labels are mapped to toxic (293,844 comments). Table 1 gives an overview of the collection by listing all datasets currently included in the collec-tion together with their number of samples, source platform, language, and class labels. The table reveals that Twitter is the primary data source and that there is no common set of class labels. As per Twitter's content redistribution policy, 4 the tweets themselves were (in almost all cases) not released by the researchers but only the tweet ids. These ids allow re-collecting the dataset via the Twitter API. Our tool automatizes this process, which is also called re-hydration.
A challenge that is not visible in Table 1 is the inherent class imbalance of many datasets. For example, the class distribution of the dataset of attacking comments by Wulczyn et al. (2017) exhibits a bias towards "clean" comments (201,081 clean; 21,384 attack), whereas the dataset by Davidson et al. (2017) exhibits a bias towards "offensive" comments (19,190 offensive; 4,163 clean). The latter class distribution is not representative of the underlying data in general. It is due to biased sampling, similar to the issues that apply to the dataset by Zhang et al. (2018). Zhang et al. (2018) collected their dataset via the Twitter API by filtering for a list of keywords, e.g., muslim, refugee, terrorist, and attack or hashtags, such as #banislam, #refugeesnotwelcome, and #Deportall-Muslims. This step introduces a strong bias because all hateful tweets in the created dataset contain at least one of the keywords or hashtags. Thus, the data is not a representative sample of all hateful tweets on Twitter, and models trained on that data might overfit to the list of keywords and hashtags. However, the advantage of this step is that it reduces the annotation effort: fewer annotations are required to create a larger set of hateful tweets. In fact, most comment platforms contain only a tiny percentage of toxic comments. Since research datasets are collected with a focus on toxic comments, they can be biased in a significant way. This focused data collection creates non-realistic evaluation scenarios and needs to be taken into account when deploying models trained on these datasets in real-world scenarios. Figure 1 visualizes the overlap of the set of class labels used in the different datasets contained in the toxic comment collection. On the one hand, there are rarely any pairs of datasets with the exact same set of labels (yellow cells). Exceptions are datasets by the same authors. On the other hand, there are also only a few pairs of datasets with no common class label at all.  Yellow cell color means that all class labels contained in the dataset of that row are also contained in the dataset of that column. See IDs in Table 1 for dataset names.

Conclusions and Future Work
In this paper, we addressed three challenges that hinder accessibility of research datasets of toxic comments: retrieving the datasets, unifying their file formats, and mapping their class labels to a common subset. To overcome these challenges, we present the toxic comment collection, which does not contain the datasets themselves, but code that automatically fetches these datasets from their source and transforms them into a unified format. Its advantages are the easy access to a large number of datasets and the option to filter by language, platform, and class label. With the toxic comment collection, we aim to foster repeatability and reproducibility of research on toxic comments and to allow research on multilingual toxic comment classification by combining multiple datasets. We are continuously adding more datasets to the collection with routines to download them and to standardize their format automatically, e.g., we plan to integrate the datasets by Kumar et al. (2018) and Zampieri et al. (2019) next. We also plan to add contact information and instructions for datasets that are not publicly accessible but available only on request, such as the datasets by Golbeck et al. (2017), Rezvan et al. (2018), and Tulkens et al. (2016).  (2017) 1.5k news en hate 9 Jha and Mamidi (2017) 10.0k Twitter en benevolent/hostile sexism 10 Mubarak et al. (2017) 31.7k news ar obscene,offensive 11 Mubarak et al. (2017) 1.1k Twitter ar obscene,offensive 12 Wulczyn et al. (2017) 115.9k Wikipedia en attack 13 Wulczyn et al. (2017) 115.9k Wikipedia en aggressive 14 Wulczyn et al. (2017) 160.0k Wikipedia en toxic 15 Albadi et al. (2018) 6.  (2018) 2.0k Twitter id abuse,offense 20 Kumar et al. (2018) 11.6k Facebook hing aggressive 21 Mathur et al. (2018) 3.2k Twitter en,hi abuse,hate 22 Sanguinetti et al. (2018) 6.9k Twitter it five classes b 23 Wiegand et al. (2018) 8.