CSECU-DSG at SemEval-2021 Task 5: Leveraging Ensemble of Sequence Tagging Models for Toxic Spans Detection

The upsurge of prolific blogging and microblogging platforms enabled the abusers to spread negativity and threats greater than ever. Detecting the toxic portions substantially aids to moderate or exclude the abusive parts for maintaining sound online platforms. This paper describes our participation in the SemEval 2021 toxic span detection task. The task requires detecting spans that convey toxic remarks from the given text. We explore an ensemble of sequence labeling models including the BiLSTM-CRF, spaCy NER model with custom toxic tags, and fine-tuned BERT model to identify the toxic spans. Finally, a majority voting ensemble method is used to determine the unified toxic spans. Experimental results depict the competitive performance of our model among the participants.


Introduction
Social media being a key factor in the world dynamics and toxicity in user-generated contents is a real threat. Threats and hatred instigated in posts and blogs implants fear in users' minds and prevents them from sharing their creative thoughts, valuable opinions to critical information. Sometimes it leads to severe mental trauma and fatalities. Hence, it is a formidable task to precisely detect toxicity in comments and posts to be able to moderate those portions and provide the users a safe online platform to express themselves.
Toxic span detection is a process where the specific toxic segment of a text is detected instead of detecting the whole text as toxic. The goal of this task is to eradicate the vagueness that is present in simple toxic text classification models and help the moderator to precisely moderate the toxic portions instead of the whole post. To elucidate the task, two examples are presented in Table 1.
The first four authors have equal contributions.
Toxic content detection on online platforms is a state-of-the-art notion. Numerous works have been done on the binary and multi-label classification of toxic texts. For instance, Georgakopoulos et al. (Georgakopoulos et al., 2018) investigated the impact of CNN in toxic comment classification against the traditional bag-of-words approaches. A multiple word embedding-based approach was adopted by Carta et al. (Carta et al., 2019) for multi-class multi-label toxic comment classification. Besides, the effectiveness of feature extraction in hate speech detection was explored by Schmidt et al. (Schmidt and Wiegand, 2017). Multitude of datasets on toxic comments such as dataset based on Wikipedia discussion comments (Wulczyn et al., 2017), comments on online forums (Borkan et al., 2019a), and offensive language identification dataset (OLID) (Zampieri et al., 2019) were also introduced.
However, very few works detect the precise toxic span from text contents. Katsiolis et al. (Katsiolis, 2020)  pervised methods to address this challenge. The unsupervised methods include the input erasure method and the LIME algorithm whereas the supervised method implements sequence labeling through a BERT model. The unintended bias created in publicly used toxicity detection models due to many reasons such as the influence of regional culture was investigated by Borkan et al. (Borkan et al., 2019a). John et al. (Pavlopoulos et al., 2017) surveyed the impact of user embeddings, user type embeddings, user biases, or user type biases on the RNN-based moderation method.
In this paper, we portray our insights acquired from experimenting on this task. We propose an approach focusing on an ensemble of sequence labeling models including the BiLSTM-CRF, spaCy NER model with custom toxic tags, and fine-tuned BERT model. We procure the spans from these models through a majority voting scheme to determine the final toxic spans.
The organization of this paper is as follows: we elucidate our proposed framework in Section 2. Section 3 encompasses the experimental details and comparative performance analysis. Finally, we conclude this paper with some future notions in Section 5.

Proposed Framework
We cast the toxic span detection as a sequence tagging task and employ an ensemble of sequence tagging models. Our proposed system comprises three individual models. The framework of our system is depicted in Figure 1. The first model is a BiLSTM-CRF model with the BIO tagging scheme. The second model is a custom spaCy named entity recognition (NER) model. The third model is a finetuned BERT model for token classification. We leverage these three models as sequence tagging models. These models generate tags in token level for a text. Subsequently, we extract span based on the toxic tags. Finally, we apply a majority voting based fusion scheme on these spans and determine the final toxic spans.

BiLSTM-CRF
The BiLSTM-CRF model is well-known for sequence-tagging tasks such as named entity recognition (NER). We utilize the model implemented by (Reimers and Gurevych, 2017). For training purposes, the dataset needs to be in CoNLL-2003 2 format where two columns for tokens and BIO tags are required. Since it requires the text to be in a tokenized form, we tokenize the text using NLTK TweetTokenizer (Bird et al., 2009). After tokenization, we label the tokens with custom tags such as B-TOX(begin), I-TOX(inside), and O(outside) utilizing the toxic span from the training dataset. These tokens are then sent to the embedding layer. The embedding layer has three variants of embeddings: word embedding, casing feature or capitalization feature, and character embedding. We employ pre-trained GloVe (6B) (Pennington et al., 2014) word embedding and a CNN based character embeddings (Ma and Hovy, 2016). The embedding vectors are concatenated and the output is fed to the BiLSTM encoder which tags tokens with the BIO tagging scheme. The BiLSTM encoder is followed by a CRF classifier where the tags are optimized enforcing the intermediate logic of tags.

Custom spaCy NER
We exploit the spaCy (Honnibal and Montani, 2017) to build an NER type sequence labeling model with the custom tag "TOXIC". We convert the dataset to spaCy entity format and load a spaCy blank English model. We append new word vectors utilizing a pre-trained word2vec (Mikolov et al., 2013) model. Consequently, we add NER pipeline to the model and also a "TOXIC" label. We disable all the pipelines except NER and loop through the training dataset several times.

Fine-tuned BERT
We finetune the state-of-the-art bert-large-cased model (Devlin et al., 2019) to identify the toxic spans. We employ the BertForTokenClassification (Wolf et al., 2019) method to perform the token level tagging. This method classifies level for each tokenized word in a sentence. To generate the training data, we convert the sentence into tokens and annotate them with spans. We tag the tokens as "non-toxic" and "toxic" whereas the tokens that are tagged as "toxic" are in between the spans.

Fusion of Models
An ensemble approach is a simulation that constructs multiple models and then blends them to bring out improved results. To obtain a more accurate solution than a single model, we apply majority voting (Rokach, 2010) on the spans generated from three models as shown in Figure 1. The primary idea is based on the frequency of the span elements. If a span is predicted by at least two models, it is included in the final predicted span. Thus, we obtain our final toxic spans through majority voting.

Dataset Description
For detecting toxic spans in posts, we used the Civil Comments Dataset (Borkan et al., 2019b) which consists of 10K toxic comments. The whole dataset is divided into three subsets where the train, trial, and test set comprises 7939, 690, and 2000 comments, respectively. Toxic comments are mainly divided into two portions: 1. Having no toxic spans and 2. Having toxic spans that are identified as spans with specific character positions. Analyzing the ratio of empty and toxic spans in our dataset we found that 90% of data occupies toxic spans where only 10% data have empty spans. F1-Score is used as the primary evaluation metric in this task.

Experimental Setup
In our CSECU-DSG system submitted to the SemEval-2021 Task 5 (Pavlopoulos et al., 2021), we make use of three sequence and entity tagging models to get better predictions. We present the configuration of our best submitted system in Table 2. Based on the predicted spans from these models, a majority voting has been applied.

Results Analysis
Now, we compare the performance of our system against other competitors' systems. Among the 91 valid submissions, the comparative performance with top-performing teams depicted in Table 3.  It depicts that our system achieved competitive performance compared to the participants' systems. It only lacks by 3% from the top-performing team HITSZ-HLT.

Discussion
To estimate the impact of individual components on the overall system's performance, we examine the performance of individual models on the test set. To do this, we make use of the test set and the findings are presented in Table 4  It shows that all three models obtained a similar kind of performance. However, employing the majority voting based scheme on these three models improves the overall result by almost 3% which leads to better detection of toxic spans from the text. Thus, we demonstrate the efficacy of utilizing the ensemble strategy to ameliorate the performance.
To qualitatively demonstrate the effectiveness of the ensemble approach compared to the individual models an instance is illustrated in Table 5. It clearly shows that majority voting helps to detect the accurate span.

Conclusion and Future Directions
In this paper, we introduced an ensemble of three distinct models to detect the toxic spans. Among these models, BiLSTM-CRF and Custom spaCy NER models are implemented as NER type sequence and entity tagging models whereas finetuned BERT model is exploited as a token classification model. We also leveraged a majority voting strategy to overcome the limitations of individual models. Our model tackles the task challenge effectively and achieved a competitive performance compared to the participants' systems.