UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification

We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.


Introduction
Detecting toxic posts on social network sites is a crucial task for social media moderators in order to keep a clean and friendly space for online discussion. To identify whether a comment or post is toxic or not, social network administrators often read the whole comment or post. However, with a large number of lengthy posts, the administrators need assistance to locate toxic words in each post to decide whether a post is toxic or non-toxic instead of reading the whole post. The SemEval-2021 Task 5 (Pavlopoulos et al., 2021) provides a valuable dataset called Toxic Spans Detection dataset in order to train the model for detecting toxic words in lengthy posts.
Based on the dataset from the shared task, we implement the machine learning model for detecting toxic words posts. Our model includes: the BiLSTM-CRF model (Lample et al., 2016) for detecting the toxic spans in the post, and the Toxi-cBERT (Hanu and Unitary team, 2020) for classifying whether the post is toxic or not. Before training the model, we pre-process texts in posts and encode them by the GloVe word embedding (Pennington et al., 2014). Our model achieves 62.23% on the test set provided by the task organizers.

Related works
Many corpora are constructed for toxic speech detection problems. They consist of flat label and hierarchical label datasets. The flat label datasets only classify one label for each comment in the dataset (e.g., hate, offensive, clean), while hierarchical datasets can classify multiple aspects of the comment (e.g., hate about racism, hate about sexual oriented, hate about religion, and hate about disability). For flat label, we present several datasets including the two datasets which are provided by Waseem and Hovy (2016) and Davidson et al. (2017) in English, the dataset of Albadi et al. (2018) in Arabic, and the dataset of by Alfina et al. (2017) in Indonesian. For the hierarchical label, we introduce the dataset constructed by Zampieri et al. (2019) in English, the dataset provided by Fortuna et al. (2019) in Portuguese, and the CO-NAN dataset by Chung et al. (2019), which is the multilingual corpus (constructed in Italian, English, and French).
Besides, SOTA approaches like deep learning (Badjatiya et al., 2017) and transformers models (Isaksen and Gambäck, 2020) are applied in the hate speech detection and toxic posts classification. However, these models only classify based on the whole posts or documents. For the Toxic Spans Detection task, we adapt the mechanism from Sequence tagging  and Name entities Recognition (Lin et al., 2017) for detecting toxic words from posts.

Dataset
The dataset is provided from the SemEval-2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021). It includes the training and the test sets. Both of them consist of two parts: the content of posts and the spans denoting the toxic words in the posts. Spans represent toxic words in the posts as a set of character indexes. Table 1 illustrates several examples from the training set. According to Table 1, a post contains multiple spans of toxic words. For each span, it contains a single word, a phrase, or a sentence. As described in Figure 1, most of the spans in the training set are single words, which account for 67.65%, while only 20.06% of spans contains two words, and 6.1% of spans is empty. Posts whose spans contain more than two words in the dataset are few. Espe-cially in the training set, there is a post in which spans contain 25 words.
Besides, Figure 2 illustrates the number of toxic words in spans per post for the test set. Spans containing single words account for the highest percentage (70.35%) in the test set, and are higher than in the training set, while the multiple-word spans are few. Also, the empty spans in the test set are higher than the training set, and the longest post in the test set contains only seven words.

System description 4.1 Data preparation
With the given dataset from the SemEval-2021 Task 5 about Toxic Spans Detection (Pavlopoulos et al., 2021), we firstly transform spans into a set of words. Then, we pre-process the posts as follows: (1) Segmenting the posts by the TweetTokenizer from nltk 2 , and (2) Changing texts to lower case.

Feature extraction
We use the glove.twitter.27b.25d word embedding 3 to construct the dictionary and encode the text of posts. Posts are encoded by the dictionary of the word embedding. The < U N K > tokens are added if a word in posts is not found in the dictionary. To make sure all vectors are the same length, we add the < P AD > token. Then, we set the maximum length of vectors equal to 128. Spans are transformed into a one-hot vector corresponding to each word in posts where toxic words are denoted as 1 and others are denoted as 0. Table  2 illustrates an example of encoding data in our system.

Training models
Detection model: BiLSTM-CRF is a deep neural model used for Named-entity recognition task (Lample et al., 2016). We implement this model for the task of detecting toxic words in documents. The model includes three main layers: (1) The word representation layer uses embedding matrix from the GloVe word embedding, (2) The BiLSTM layer for sequence labeling, and (3) The Conditional Random Field (CRF) layer to control the probability of output labels. The output is a binary vector, in which each value determines whether

No Posts
Spans 1 What a knucklehead. How can anyone not know this would be offensive?? [7,8,9,10,11,12,13,14,15,16,17] 2 I only use the word haole when stupidity and arrogance is involved and not all the time. Excluding the POTUS of course.
Our system combines the detection and classification model together. The detection model (BiLSTM-CRF) returns the toxic spans from the post, while the classification model (ToxicBERT) classifies whether a post is toxic or non-toxic. If a post is non-toxic, the classification model returns an empty span. By contrast, it reserves the spans of the detection model. Then, predicted spans are decoded to character indexes for submission. Our system is illustrated in Figure 4 4 https://huggingface.co/unitary/ toxic-bert  Figure 3: The BiLSTM-CRF model architecture.

Empty spans []
Figure 4: Our system architecture.

Evaluation metric
The variant version of F1-score is used to evaluate the results of the competition (Da San Martino et al., 2019). Let T is the total of post in the dataset, T = [t 1 , t 2 , ..., t n ], n is the number of posts, A is spans given by the model, and G is ground truth spans. The F1-score over the dataset is defined as: In the Equation 1, P t determines the precision, and R t determines the recall of the post t. The precision and recall are calculated as Equation 2 and Equation 3, respectively. The S t in both Equation 2 and Equation 3 is set of toxic characters of post t (span).  According to Table 3, when only BiLSTM-CRF is used, the result by F1 score is 61.32%. The result increases up to 62.23% when we applied ToxicBERT Classifier, and this is our final result of the shared task (ranked 63 th among 92 teams).

Error analysis
According to

Conclusion
We use the BiLSTM-CRF and ToxicBERT models for detecting toxic words in the posts. Our model achieves 62.23% by F1-score from the competition. From the error analysis, we found that our model predicts well just for single-word spans and empty spans.
In further researches, we improve the performance of the detection model by applying the attention mechanism and using the character-level representation combining with word-level representation. Character-level models like CharBERT  is a potential approach to increase the performance of toxic spans detection tasks.