HITSZ-HLT at SemEval-2021 Task 5: Ensemble Sequence Labeling and Span Boundary Detection for Toxic Span Detection

This paper presents the winning system that participated in SemEval-2021 Task 5: Toxic Spans Detection. This task aims to locate those spans that attribute to the text’s toxicity within a text, which is crucial for semi-automated moderation in online discussions. We formalize this task as the Sequence Labeling (SL) problem and the Span Boundary Detection (SBD) problem separately and employ three state-of-the-art models. Next, we integrate predictions of these models to produce a more credible and complement result. Our system achieves a char-level score of 70.83%, ranking 1/91. In addition, we also explore the lexicon-based method, which is strongly interpretable and flexible in practice.


Introduction
41% of American adults in 2020 are reported experiencing some form of harassment 1 . Increasing incidents of online harassment and cyber violence have spurred researchers to investigate the problem of identifying and filtering offensive speech on the Internet. Most previously published insult detection tasks (Davidson et al., 2017;Xu et al., 2012) and methods (Aroyehun and Gelbukh, 2018; Modha et al., 2018) classify an entire comment (or document) to discern whether the comment is offensive or not, but cannot identify specific pieces of the toxic comment. Unlike previous studies, SemEval-2021 Task5: Toxic Span Detection (Pavlopoulos et al., 2021) requires the identification of the specific toxic spans, which is more innovative and challenging, and a key step towards a successful semi-automatic review of comments. † Authors equally contributed to this work. ‡ Corresponding Author: xuruifeng@hit.edu.cn 1 https://www.pewresearch.org/internet/2021/01/13/thestate-of-online-harassment/ More formally, toxic span detection is an extraction task, which is usually formalized as a Sequential Labeling (SL) problem, as shown in Figure  1(a), locating those spans by BIO tags. However, SL methods suffer from a huge search space due to the compositionality of labels (the power set of all sentence words), which has been proven in (Lee et al., 2016;Hu et al., 2019a). Therefore, in addition to SL formalization, we also formalize the task as a Span Boundary Detection (SBD) problem, as shown in Figure 1(b), locating those spans by start and end positions. Notice that, when there are multiple spans in a sentence, the matching of start and end positions may be ambiguous during decoding. This shows that theoretically, the SBD formalization is not consistently superior to the SL formalization. Hence, we choose to combine predictions of these two kinds of formalization to produce a more credible and complement result. Our system achieves a char-level score of 70.83%, ranking 1/91. Besides, we also explore the lexicon-based methods, which usually have high precision but rather low recall, and are strongly interpretable and flexible in practice. First, we mine a toxic lexicon from the training set by a simple statistical strategy. Next, WordNet (Fellbaum, 2010) and GloVe (Pennington et al., 2014) are utilized to extend this lexicon further. With a toxic lexicon, we extract toxic spans through word-level matching.

Related Work
In recent years, cyber violence has become a widespread societal concern, and how to identify and filter hate speech has become an important topic in machine learning. TRAC proposes an aggression recognition task (Kumar et al., 2018) that provides a dataset of 15,000 annotated Facebook posts and comments in English and Hindi for training and validation. The task aims to classify comments into three categories: non-aggressive, covertly aggressive, and overly aggressive. The Toxic Comment Classification Challenge 5 2 is an open competition in Kaggle that provides participants with comments from Wikipedia and defines six toxic categories: toxic, severe toxic, obscene, threat, insult, identity hate. In SemEval 2019 task 6 (Zampieri et al., 2019), in addition to whether the comment is offensive, the type of the attack and the target of the attack are also included. Based on this, Semeval 2020 task 12 (Zampieri et al., 2020) further extends the dataset to 5 languages: Arabic, Danish, English, Greek, and Turkish.

Methods
In the section, we describe how toxic span detection is formalized and corresponding solutions in detail.

Sequence Labeling
The BIO tag scheme is utilized to locating toxic spans, where B (Begin) corresponds to the first token in a toxic span, I (Inside) corresponds to the inside and end tokens in a toxic span, and O corresponds to those no-toxic tokens. Following most existing work (Lample et al., 2016;Ma and Hovy, 2016), we leverage Conditional Random Fields (CRF) (Lafferty et al., 2001) for learning and inference. In addition to token-level classification, CRF models the dependencies between tags in a tag sequence by the transition matrix A ∈ R K×K , where K is the size of the tag space, i.e. K = 3. For the contextual representation x ∈ R n×h , the score of a tag sequence y ∈ R n in CRF is defined as: 2 https://www.kaggle.com/c/jigsaw-toxic-commentclassification-challenge where h k (y k ; x) is the score of the tag y k at the k time step. Then, the conditional probability is obtained by a normalization operation: .
( 2) where Y contains all possible paths of tag sequences. During inference, the predicted tag sequenceŷ is obtained by: We adopt BERT (Devlin et al., 2019) and BERT+LSTM (Hochreiter et al., 1997) as the language encoder respectively, resulting in two solutions: BERT+CRF and BERT+LSTM+CRF. The reason for adding LSTM is that we believe that the contextual representation refined by LSTM could be more sensitive to the position of tokens.

Span Boundary Detection
Different from SL formalization, SBD formalization utilizes the start and end positions tagging scheme to represent toxic spans. SBD formalization was originally applied in the machine reading comprehension task (Seo et al., 2016;Wang and Jiang, 2016). In these works, two n-classifiers are employed to predict the start position and end position separately, where n denotes the length of the input sentence. However, this strategy can only output a single span for an input sentence. Later, Hu et al. (2019b) extended the two n-classifiers strategy by a heuristic multi-span decoding algorithm. But this is not a concise and efficient solution for multi-span scenario, as the decoding algorithm relies on two hyper-parameters: (1) γ, the minimum score threshold, (2) K, the maximum number of spans. In addition to the two n-classifiers strategy, a more recent and popular strategy is to employ two binary classifiers to determine whether each token is the start (end) position or not (Li et al., 2020;Wei et al., 2020;Yu et al., 2019). In this paper, we adopt the binary classifiers strategy for SBD formalization and describe the details below. Given the contextual representation x = {x 1 , x 2 , · · · , x n } ∈ R n×h , for the location i, we calculate the probability of whether it is a start position by Equation (4) and the probability of whether it is a end position by Equation (5).
where W 1 ∈ R h×1 , W 2 ∈ R (h+1)×1 and b 1 , b 2 ∈ R are model parameters. The predictions of start and end positions are obtained by: starts = {i|p start (i) > 0.5, i = 1, · · · , n}, (6) ends = {i|p end (i) > 0.5, i = 1, · · · , n}. (7) Then we adopt the nearest start-end matching strategy: for each predicted start position s ∈ starts, the nearest predicted end position e to the right of s is selected to formal a predicted span (s, e).
Similarly, we adopt BERT as the language encoder, and we call this model as BERT+Span.

Ensemble Strategy
Voting method is applied to integrate the results. In detail, for k different models, if no less than k/2 models consider a character to be in the toxic span, the character is retained.

Data
The given trial data and training data are merged and the duplicates are removed. In addition, we fix some annotation errors, such as the partiallylabeled words. 80% of the processed data is utilized for training and the rest is the validation set. Table  1 shows the statistics of the data used.

Parameter Settings
We find that the parameter size of the pre-trained model does not have a significant effect on performance, and therefore we simple adopt BERT-base as the our language encoder, which consists of 12 transformer blocks with 12 representation heads. Three models are trained separately. The learning  rate of BERT is set to 2e-5, the learning rate of CRF is set to 5e-3, and the maximum encoding length is 128. The weight decay is set to 0.01.

Evaluation Metrics
We use the official metric, i.e. char-level F 1-score, as the evaluation metric. In addition, for a more detailed analysis, we also introduce character-level Precision (P ) and Recall (R). Note that F 1/P/R is the average over the samples, so there is no F 1 = 2P R/(P + R). Table 2 shows the performance of three benchmark models and the ensemble approach. The experimental results show that all three models achieve similar results on F 1-score, and integrating them results in an improvement of more than 1%, indicating that the predictions of the three models have good complementarity.

Ensemble Approach
To further analyze the differences and respective advantages of SL and SBD formalization, we list their performances in single-span scenario and multi-span scenarios in Figure 2. It could be found that SBD formalization is more advantageous in single-span scenario, while SL formalization is more advantageous in multi-span scenario, which is consistent with our claim.

Lexicon-based Approach
We also explore a lexicon-based approach for predicting toxic spans. A toxic lexicon is mined from training data by a simple statistical strategy. More Specifically, the toxic score of a word w is defined as below: toxic score(w) = #w in toxic span #w in whole corpus , where #w in toxic span is the count of appearances of word w in toxic spans, and #w in whole corpus is the count of appearances  of word w in the whole corpus. Then those words with a toxic score greater than a given threshold θ are selected from a lexicon. When predicting, the words in the sentence that appear in that toxic lexicon are extracted as the predicted toxic spans. There are three lexicons in our experiment, two of which were collected by (Wiegand et al., 2018), another is collected by ourselves from the training set. Table 3 shows the results of the lexicon-based approaches and the ensemble approach, and we can observe that our lexicon-based approaches obtain notable results in the F 1-score. In addition, we also calculate the average precision and average recall values of different methods on the test set, and our original lexicon-based approach even outperforms ensemble approaches in average precision, but there is still a significant gap in an average recall. Since the lexicon-based approaches can only identify the toxic words in the lexicon, the recall can be improved by expanding the toxic lexicon.
To improve the recall, we use WordNet (Miller, 1995) and GloVe (Pennington et al., 2014) to ex- pand the toxic lexicon. In detail, we collect synsets of each toxic from WordNet, and collect the nearest similar words by calculating cosine similarity of GloVe vectors. The performances of the two expanded approaches are shown in Table 3. Although the recall of two approaches improves over the original lexicon, the precision decreases significantly, which indicating that there are a considerable number of non-toxic words in the synonyms found through WordNet.
Besides, we explore the impact of threshold θ when mining the original lexicon on performance. The performances with different threshold is shown on Figure 4. As the threshold θ increases, the size of lexicon decreases, P decreases, R increases, F 1 increases and then decreases, reaching a maximum 64.98 when θ = 0.5.

Conclusion
In this paper, we formalize the toxic span detection as two problems separately and employ three stateof-the-art models. The strengths of each model are analyzed and a more credible and complement result is obtained through a voting approach. Our re-sults achieve a good score (ranking 1/91). Besides, we explore a lexicon-based approach. The lexicon is mined from the annotation of the training data and then expanded by WordNet and Glove. Experiments show that the lexicon-based approach has not yet achieved the performance of the ensemble approach. We believe that future work could move towards combining deep learning-based methods and lexicon-based methods.