UIT-E10dot3 at SemEval-2021 Task 5: Toxic Spans Detection with Named Entity Recognition and Question-Answering Approaches

The increment of toxic comments on online space is causing tremendous effects on other vulnerable users. For this reason, considerable efforts are made to deal with this, and SemEval-2021 Task 5: Toxic Spans Detection is one of those. This task asks competitors to extract spans that have toxicity from the given texts, and we have done several analyses to understand its structure before doing experiments. We solve this task by two approaches, Named Entity Recognition with spaCy library and Question-Answering with RoBERTa combining with ToxicBERT, and the former gains the highest F1-score of 66.99%.


Introduction
The world of social media is overgrowing, and users easily express their opinions or feelings toward topics that they are concerned about. However, because of the freedom of speech, lots of toxic comments or contents are uncontrollably increasing. There are several kinds of research about the effect of toxic speech on users' health. In 2017, research about the impact of toxic language on health was conducted (Mohan et al., 2017). Sometimes, with toxic words, conversations can become cyberbullying, cyber threats, or online harassment, which are harmful to users. To reduce those negative impacts, there are abundant researches for classifying contents into toxic or non-toxic, and then they hide the whole text if it is toxic. However, that action may inhibit the freedom of speech. As a result, censoring only toxic spans is the better solution for this problem. Therefore, in SemEval-2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021) we try to realize it.
About toxic contents on the internet, researches were only about binary toxicity classification. Still, in task 5 of SemEval-2021, which is about toxic spans detection, we conduct more in-depth research into the toxicity, find exactly which parts of the text are toxic. As the NER approach and Question-Answering (QA) approach, we propose two approaches for solving this problem. We use RoBERTa (Liu et al., 2019) combining with Tox-icBERT (Hanu and Unitary team, 2020), transfer learning models, for QA approach and spaCy's library (Honnibal and Montani, 2017) for NER approach.
We organize the paper as follows. Section 2 is related works that we consult for building the systems. The dataset and analyses are defined in Section 3. In section 4, we introduce our two proposed systems for toxic spans detection. Section 5 describes the results of the studies and analyses. Finally, in Section 6, we bring our work to a close.

Related Works
Researchers around the world these days have started to concentrate on toxic speech. It inflicts individual and group harm, damaging our social fabric (Tirrell, 2018). Several datasets for classifying toxicity on toxic speech on online forums, such as the dataset provided by Waseem and Hovy (2016) for English, BEEP! dataset for Korean by Moon et al. (2020), the dataset for Russian provided by Smetanin (2020), TolD-Br dataset for Brazilian Portuguese by Leite et al. (2020), and UIT-ViCTSD, a dataset about constructive and toxic speech detection for Vietnamese by Nguyen et al. (2021).
Besides, there are shared tasks about toxic speech as well as hate speech such as these from SemEval, includes SemEval-2019 Task 5 Multilingual Detection of Hate (Basile et al., 2019), SemEval-2019 Task 6 Identifying and Catego-rizing Offensive Language in Social Media (Of-fensEval) (Zampieri et al., 2019), SemEval-2020 Task 12 Multilingual Offensive Language Identification in Social Media (Zampieri et al., 2020), and SemeEval-2021 Task 5 Toxic Spans Detection (Pavlopoulos et al., 2021), which is the current task we have to deal with in this paper.

Dataset
The origin of this SemEval-2021 Task 5 dataset comes from the publicly available Civil Comments dataset (Borkan et al., 2019), which consists of 1.2M posts and comments. The data in this public dataset have no annotation of any toxic spans in toxic posts but do have post-level toxicity annotations, which mean showing which posts or entire of them are toxic. And the holders of this task retain 30K of them, which were annotated to be toxic or severely toxic by at least half of the crowd-raters from annotations of Borkan et al.
The task holders then randomly keep 10K posts from the 30K posts for annotating toxic spans. They employ three experienced crowd-raters per post from a third-party crowd-annotation platform, and they warn them about adult content. However, task organizers also claim that not all toxic posts are annotated with toxic spans.
[] Not if they shoot you first...
The competitors receive two separate training and test sets from organizers. In the training set, there are 7,939 records, and in the test set, there are 2,000 records. Furthermore, as mentioned in the data annotating process, one text that possibly has multiple toxic spans is highlighted. Figure 1 and Figure 2 illustrate the distribution of spans in the training and the test sets.  For more details, according to Figure 1 and Figure 2, there is a significant number of single spans in each post, and it accounts for nearly 68.8% and 70.8% in the training set and the test set, respectively. It is also interesting to notice that the number of zero spans is not tiny, and the proportion of it in the training set is less than in the test set, more specifically, 19.7% in the test set and 6.15% in the training set.
Moreover, we also calculated the Jaccard score of text and spans in the given dataset for more indepth analysis. The Jaccard score, also known as the Jaccard index or Jaccard similarity coefficient, was developed by Paul Jaccard (Jaccard, 1912) and it is a statistic used for measuring the similarity and diversity of sample sets as follows. The histogram in Figure 3 illustrates that most of the data points have Jaccard scores in the range of 0 to 0.35, and the peak is at 0 to 0.05, which means toxic character offsets are just a fraction in each post even there are records annotated all characters of the post are toxic. There are 16 records in the test set and 212 records in the training set with Jaccard scores at 0.95 to 1.0. For that reason, just the toxic part(s) of the comments needs to be censored rather than the whole comment as in the traditional method.

Systems
In this paper, we propose two systems for the toxic spans detection task with NER and QA approaches. The first system is the QA approach based on RoBERTa and the second system is the NER approach based on spaCy's library.

Question-Answering Approach Based on RoBERTa
With the QA approach, we use RoBERTa combining with ToxicBERT as the basis for the system. RoBERTa (Liu et al., 2019) is a transfer learning model and it is a replication study of BERT (Devlin et al., 2019). Unlike BERT, to improve the training performance, RoBERTa eliminates the Next Sentence Prediction (NSP) task of the pre-trained model BERT. ToxicBERT (Hanu and Unitary team, 2020) is also a transfer learning model, and it uses BERT as the main model for classifying toxicity.
ToxicBERT has an outstanding performance for the task of Jigsaw Unintended Bias in Toxicity Classi-fication 1 on Kaggle, which uses the same dataset with SemEval-2021 Task 5, with 93.64% F1-score. We use two models for our QA approach system, and the overview of the system with training and testing phases is described in Figure 4.  Firstly, we preprocess the training set with techniques to get the right format for the RoBERTa model, mentioned in Figure 5. The model we used only approve one spans, but several examples have more than one in the training set, and we called it "multi-span". Hence, we split multi-span (*) into single spans (**) (***) as below.
• Plain text: (*) This bitch is so fucking idiot.
After splitting texts, we tokenize the dataset with a subword model as Byte-Pair Encoding (BPE) (Sennrich et al., 2016). Then, we feed the data into a pre-trained RoBERTa model and fine-tune it with suitable parameters. We analyze the length of the texts in the dataset and set max_length=512 and epochs=5 for the model. After searching for extensive hyper-parameters, we set the learning_rate and drop_out equal to 3e-5 and 0.1, respectively. We also train the model with 5-fold cross-validation. After the training phase, the trained RoBERTa model is used for predicting new toxic spans. In the testing phase, besides using RoBERTa, we use another transfer learning model is ToxicBERT (Hanu and Unitary team, 2020) for identifying toxic comments. With ToxicBERT, we classify the input text into toxic or non-toxic labels before predicting spans. If the result is non-toxic, we stop the prediction, and the result is an empty spans. If it is toxic, we feed the text into the RoBERTa model to predict toxic words. After having the spans, to ensure that the text still has toxic words, we remove the predicted toxic word(s) from the processing text and then recheck its toxicity by ToxicBERT and re-predict its remaining toxic words (if any).
Because final results are words, we transform them into spans for the requirement of this task.

NER Approach Based on spaCy's Library
In this approach, we tag all the characters spans with text as TOXIC to train the model, and we predict all TOXIC tags in the text set of texts.
For solving this, we choose version 2.2.5 of spaCy's NER Model (Honnibal and Montani, 2017) because of its exceptionally efficient statistical system in both speed and accuracy for this named-entity recognition. Apart from default entities such as location, person, organization, and so on, spaCy also enables training the model with new entities by updating it with newer examples. The above Figure 7 shows the process of our spaCy based system. Both training and test sets have to be tokenized before feeding them into the spaCy NER model or being predicted by the TOXIC entities. For more details, in the training phase, the input data have to be in the right format for the spaCy NER model as in the following Figure 8. SpaCy has not published the architecture of their models yet, but they do have a brief explanation about how their models work, especially the NER model, through a four-step formula: embed, encode, attend, and predict. As in Figure 9, spaCy's model is fed with unique numerical values (ID) which address a token of a corpus or a class of the NLP task (named entity class). In the first embed stage, word similarities are revealed by extracting hash, which is collected by extracting word features as the lower case, the prefix, the suffix, and the shape. The encode stage is fed with a sequence of word vectors from the previous stage to calculate a representation which is named sentence matrix. In the sentence matrix, the meaning of each token in the context of neighboring tokens is represented in each row, and this is done by using a bidirectional RNN (Schuster and Paliwal, 1997). The output matrix from the second stage is injected into the Attention Layer of the CNN after summarized by a query of vectors. Finally, to predict the toxic class, a softmax function is utilized. After the model is trained, the CNN model is now used for the NER task to extract the toxic class.
The given toxic spans dataset is fed into spaCy's library for training with a suitable format. During the contest, my team was using spaCy's library for a small model for English (en_core_web_sm) at version 2.2.5, and we tried different parameters to get the optimal result. When training, the dataset is shuffled and passed through spaCy's training algorithm in batches with an increment of batch sizes from 4.0 to 32.0 and step of 1.001. Moreover, the drop rate is consistently at 0.5, and most of the experiments loop 45 times.

Experiments
After building two such systems, we start to experiment on the test set, and the following subsections dicuss our results.

Evaluation Metrics
Before going through experimental results, we first discuss the evaluation metrics used in this SemEval-2021 Task 5.
In this task, all of the responding systems from participants are evaluated by F1 score (Da San Martino et al., 2019). Assuming the system S i returns C t S i , which is a toxic character offsets of the post. Let G t be the character offsets of the ground truth annotation of t. In the following formulas, the F1 score of system S i is computed regarding ground truth G of post t (|·| indicates set cardinality).
of all over the posts t if test set to get a sigle F1 score of the system S i .

Experimental Results
The results of our systems compared with other teams' are shown in Table 2. During the SemEval-2021 Task 5, with the spaCy base system, we achieved rank 34 out of 91 teams, and in the table above, we have shown our result with the spaCy based system and the RoBERTa based system in comparison with rank 1, 2, 33 and random baseline of this task. The F1score of our best system is 66.99%, 3.84% lower than the first rank team, and 49.09% higher than the baseline model.

Result Analyses
After analyzing our most effective system based on spaCy's library, we spot crucial errors in predicting and datasets by comparing predicted spans to gold spans. Several records in the given data are standing alone without the context that leads to confusing or multi-meaning. Moreover, comments are using slang(s) or idiom(s), causing null output for our system. We also realize a lack of consistency or highlighting non-toxic spans when annotating data about the datasets. Likewise, several words in the text have spelling mistakes that intentionally also impair our system performance. Evidence for those errors are in Table 3, Appendix.

Conclusion and Future Work
In this paper, we introduced two proposed systems for toxic spans detection based on named entity and question-answering approaches. We obtained the highest results with the SpaCy's library based system with the F1-score of 66.99% and ranked 34 out of 91 teams in SemEval-2021 Task 5.
In future, we plan to improve our systems by implementing various SOTA models for toxic spans detection. With the built systems, we can create friendly online conversations and make social media forums safer for users.