YNU-HPCC at SemEval-2021 Task 5: Using a Transformer-based Model with Auxiliary Information for Toxic Span Detection

Toxic span detection requires the detection of spans that make a text toxic instead of simply classifying the text. In this paper, a transformer-based model with auxiliary information is proposed for SemEval-2021 Task 5. The proposed model was implemented based on the BERT-CRF architecture. It consists of three parts: a transformer-based model that can obtain the token representation, an auxiliary information module that combines features from different layers, and an output layer used for the classification. Various BERT-based models, such as BERT, ALBERT, RoBERTa, and XLNET, were used to learn contextual representations. The predictions of these models were assembled to improve the sequence labeling tasks by using a voting strategy. Experimental results showed that the introduced auxiliary information can improve the performance of toxic spans detection. The proposed model ranked 5th of 91 in the competition. The code of this study is available at https://github.com/Chenrj233/semeval2021_task5


Introduction
Existing toxicity detection datasets and models classify the entire comment or document and do not identify the range that makes the text toxic. A system that accurately locates the toxicity range in the text is crucial in achieving semi-automatic review. As a complete submission for the shared task, systems are required to extract a list of toxic spans or an empty list per text. We define a sequence of words that attribute to the text's toxicity as the toxic span. Table 1 shows two toxic spans, "stupid" and "a!@#!@," which have character offsets from 10 to 15 (counting starts from 0) and from 51 to 56, respectively. Systems are then expected to return the offset list for this text.
Text This is a stupid example, so thank you for nothing a!@#!@. Offset List [10,11,12,13,14,15,51,52,53,54,55,56]  The main purpose of this task is to identify the toxic spans in a given text; this can be transformed into a sequence labeling task in natural language processing. Unlike normal sequence labeling tasks, this task is more challenging because the toxic spans in the text may involve a word, phrase, or even a sentence. Traditional methods used to address the problem of sequence labeling include conditional random fields (CRF) (Lafferty et al., 1999), combined models of both long-short-term memory and CRF (LSTM-CRF) (Gupta et al., 2019), and bidirectional encoder representation from transformers (BERT) (Devlin et al., 2019).
In this study, we use BERT, ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), and XLNET (Yang et al., 2019) to solve this problem. Compared with the conventional model, our model adds auxiliary information to improve the performance in this task. After a simple analysis of the text data, it can be found that not all the words in the toxic span have a toxic meaning, and some toxic meanings occur in a specific context or semantic conditions. Therefore, if the tokens can be classified with the auxiliary information such as sentence representation, the performance of the model will improve. The results of the experiment prove that some of the proposed methods are effective. By using ensemble learning, we merge the results of the BERT, ALBERT, RoBERTa, and XLNET models into the final prediction, obtaining the 5th rank out of 91 and a F 1 score of 0.696.  The remainder of this paper is organized as follows. Section 2 describes the specific structure of the adopted model. The experimental results are summarized in Section 3. Finally, Section 4 presents the conclusions of the study.

Transformer-based Model with
Auxiliary Information Figure 1 shows the architecture of the proposed model, which consists of three layers: a transformer-based layer, an auxiliary information layer, and an output layer. The transformer-based layer can be BERT, ALBERT, RoBERTa, XLNET, or any other transformer-based model. In the auxiliary information layer, several approaches are applied to combine token representation. The combined token representations are used in the output layer to output the label of each token.

Transformer-based Layer
The transformer-based layer is the first part of the model. The purpose of this layer is to obtain the representation of tokens and the entire text. For illustration, we can use the BERT-large (Devlin et al., 2019) model to produce token representations from each layer. With BERT-large, 25 layers of token representation vectors can be obtained: one embed-ding representation and twenty-four hidden states. Unlike previous methods, 25 layers of token representation vectors are combined by using several methods in the next layer. The representations produced by the transformer-based layer are then fed into the next layer.

Auxiliary Information Layer
The traditional method directly passes the token representation vectors to the classification layer. To improve the performance of the model, we attempt to combine token representation vectors and the sentence representation vector in different ways. Figure 2 depicts the attempted methods, which are described as follows: • Method 1. Token vector of the last layer and the sentence vector.
• Method 2. Token vector of the last layer concatenated with the sentence vector.
• Method 3. Linear combination of the token vector of each layer.
• Method 4. Linear combination of the token vector of each layer and the sentence vector.
The combined representation of tokens passes on to the next layer.

Output Layer
The output layer is a fully connected dense layer with softmax activation. It aims to classify whether a token belongs to the toxic span in a text. The combined representation of each token passed by the auxiliary information layer is the input of this layer, and the output layer predicts the labels for the candidate tokens. The loss function of the proposed model is the categorical cross-entropy.

Experimental Results
In this section, we present the comparative results of the proposed model.

Dataset
During the competition, we used only the data (Pavlopoulos et al., 2021) provided by the task organizer for the experiments. This task involves trial data (which include 689 posts and spans), training data (which include 7939 posts and spans), and test data (which include 2000 posts). We used the training data as the training set and trial data as the validation set. We needed to find the subscript offset set of the toxic spans of each post in the test data.
As this is a sequence labeling task, a common data preprocessing method is to use the BIO tagging format. We observed better performance when the IO tagging format was adopted during the actual training process. Therefore, our output layer was a two-classification layer that outputs the probability of a token belonging to a toxic span.

Evaluation Metrics
For this task, we employed the F 1 -score metric (da San Martino et al., 2020) to evaluate the responses of a system participating in the challenge.
For each post, t i , the predicted span was a set, S i , of character offsets and G i was the character offset of the groundtruth annotations of t i . The F 1 score of ti was calculated as follows: where P (S i , G i ) and R(S i , G i ) are respectively precision and recall scores defined as follows: If G i is empty for some post t i , we set F 1 (S i , G i ) = 1 if S i is also empty and F 1 (S i , G i ) = 0 otherwise. Finally, we averaged F 1 (S i , G i ) over all posts t i .

Implementation Details
Each model was fine-tuned for eight epochs. We used the Adam (Kingma and Ba, 2015), AdamW (Loshchilov and Hutter, 2017), and Stochastic Gradient Descent (SGD) algorithm for optimization. The final one used was AdamW with a learning rate of 5e − 6.
In the training process, we attempted to use the cross-entropy loss, focal loss (Lin et al., 2020), and Dice loss (Li et al., 2020). The results on the validation set showed that the focal loss and Dice loss are better than the cross-entropy loss. This may be due to an imbalance between the toxic and nontoxic categories in the text. In order to compare with the baseline model, we finally used the crossentropy loss function to train all models.

Comparative Results
We used BERT, ALBERT, RoBERTa, and XLNET as the transformer-based layers. The model exhibit-ing the best performance on the validation set in the eight epochs was used to predict the spans on the test set in the competition. The results on the test set are presented in Table 2. The model that performed the best on the test set over the eight epochs is also presented in Table 2.
In terms of the performance on the validation set, the BERT and XLNET models with the auxiliary information layer are better than those without. Method 4, mentioned earlier, achieves the highest F 1 score. In case of ALBERT, only method 1 improves the performance. Methods 3 and 4 can improve the performance of RoBERTa.
Regardless of the performance on the validation set, the F 1 score increases by 0.004 when using method 3 in the BERT model and increases by 0.002 when using method 4 in XLNET. The auxiliary information layer does not improve the performance of ALBERT and RoBERTa.
The results show that the performance of the best-performing model on the validation set is significantly different from that of the best-performing model on the test set. The reason for this difference may be the inconsistent data distribution of the validation and test sets.
However, the results indicate that when the validation set is not appropriate, the auxiliary informa-tion layer can effectively improve the performance of the baseline model on the validation set. The BERT and XLNET models are the most suitable for the auxiliary information layer.

Conclusion
In this paper, we introduce the method we used in SemEval-2021 Task 5. We improved the performance of the basic model by reducing the number of categories for each token, selecting the appropriate loss function, adding some additional information to the representation vector of the tokens during classification, and finally obtaining a model that can detect the toxicity in a text. Our experimental results showed that adding auxiliary information to the original token representation vector is helpful in sequence labeling tasks.
In addition, we found that the model has some limitations. After analyzing the prediction results, we observed that although the model can learn the representation of each token well, token classification errors can occur when some tokens are toxic without the entire text being toxic. One possible solution for this is to add a text classification task to train the model.