MIPT-NSU-UTMN at SemEval-2021 Task 5: Ensembling Learning with Pre-trained Language Models for Toxic Spans Detection

This paper describes our system for SemEval-2021 Task 5 on Toxic Spans Detection. We developed ensemble models using BERT-based neural architectures and post-processing to combine tokens into spans. We evaluated several pre-trained language models using various ensemble techniques for toxic span identification and achieved sizable improvements over our baseline fine-tuned BERT models. Finally, our system obtained a F1-score of 67.55% on test data.


Introduction
Toxic speech has become a rising issue for social media communities. Abusive content is very diverse and therefore offensive language and toxic speech detection is not a trivial issue. Besides, social media moderation of lengthy comments and posts is often a time-consuming process. In this regard, the task of detecting toxic spans in social media texts deserves close attention.
This work is based on the participation of our team, named MIPT-NSU-UTMN, in SemEval 2021 Task 5, "Toxic Spans Detection" (Pavlopoulos et al., 2021). Organizers of the shared task provided participants with the trial, train, and test sets of English social media comments annotated at the span level indicating the presence or absence of text toxicity. We formulated the task as a token classification problem and investigated several BERTbased models using two-step knowledge transfer. We found that preliminary fine-tuning of the model on data that is close to the target domain improves the quality of the token classification. The source code of our models is available at https: //github.com/morozowdmitry/semeval21. The paper is organized as follows. A brief review of related work is given in Section 2. The definition of the task has been summarized in Section 3. The proposed methods and experimental settings have been elaborated in Section 4. Section 5 contains the results and error analysis respectively. Section 6 is a conclusion.

Related Work
Computational approaches to tackle text toxicity have recently gained a lot of interest due to the widespread use of social media. Since moderation is crucial to promoting healthy online discussions, research on toxicity detection has been attracting much attention. Our work is also related to hate speech and abusive language detection (Fortuna et al., 2020). The toxic speech detection task is usually framed as a supervised learning problem. Moreover, fairly generic features, such as bag of words (Harris, 1954) or word embeddings (Mikolov et al., 2013), systematically yield reasonable classification performance (Fortuna and Nunes, 2018;Schmidt and Wiegand, 2017). To better understand the mechanisms of toxic speech detection, some scholars (Waseem et al., 2017;Karan andŠnajder, 2018;Swamy et al., 2019) compared different techniques for abusive language analysis. Neural architectures and deep learning methods achieved high results in this domain. Thus, Pavlopoulos et al. (2017a,b) explored the possibilities of deep learning and deep attention mechanisms for abusive comment moderation. Park and Fung (2017) proposed an approach to performing classification on abusive language based on convolutional neural networks (CNN). Chakrabarty et al. (2019) used Bidirectional Long-Short Term Memory network. Castelle (2018) experimented with CNN and Gated Recurrent Units. Some recent studies (Mozafari et al., 2019;Risch et al., 2019;Liu et al., 2019a;Nikolov and Radivchev, 2019) utilized pre-trained language models such as Bidirec-tional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) to detect offensive or abusive language.
In recent years, the task of detecting and analyzing abusive, toxic, or offensive language has attracted the attention of more and more researchers. The shared tasks based on carefully curated resources, such as those organized at the SemEval (Zampieri et al., 2019;Basile et al., 2019), GermEval (Wiegand et al., 2018), EVALITA (Bosco et al., 2018), and OSACT (Mubarak et al., 2020) events, have significantly contributed to the progress of the field and to the enrichment of linguistic resources. In addition to the corpora collected for these shared tasks, Rosenthal et al. (2020) released a large-scale dataset for offensive language identification. Ibrohim and Budi (2018) Komalova et al. (2021) presented various datasets for abusive speech detection in non-English languages. Most of these datasets classify whole texts or documents, and do not identify the spans that make a text toxic.

Shared Task
The task focuses on the evaluation of systems that detect the spans that make a text toxic, when detecting such spans is possible. The goal of the task is to define a sequence of words (character offsets) that attribute to the toxicity of the text, for example: • Input. "This is a stupid example, so thank you for nothing a!@#!@".
The sources of data were various posts (comments) from publicly available datasets. The provided dataset contains 10,629 posts split into training (7939), trial (690), and test (2000) subsets.
Inspired by Da San Martino et al. (2019), the organizers proposed to employ the F1-score for evaluating the responses of a system participating in the shared task. Let system A i return a set S t A i of character offsets, for parts of the post found to be toxic. Let G t be the character offsets of the ground truth annotations of t. The F1-score of system A i is calculated with respect to the ground truth G for post t as follows, where | · | denotes set cardinality.
The final F1-score is an average F t 1 (A i , G) over all the posts t of an evaluation dataset T to obtain a single score for system A i .

Methodology
The stated problem was modified from char-level to token-level binary-classification. The proposed solution utilizes a pre-trained language model with a classification head to classify tokens. Different configurations of BERT pre-trained as masked language models were considered as a backbone.
Due to the lack of available token-level labeled public datasets for toxic comment and the relatively small size and sparsity of dataset provided by the competition, the following training pipeline was proposed to enhance knowledge transfer. First, fine-tune pre-trained BERT on a larger-scale task of toxic comment classification, using the Jigsaw dataset 1 from which the competition data were constructed. Second, fine-tune obtained model to solve the actual toxic tokens classification problem. The exact training parameters are to be found below.
For the first step: • remove texts occurred in spans dataset from classification dataset to prevent data leakage (so as spans dataset is sampled from classification dataset); • 4 epochs, 200 tokens max length, 64 batch size, 10 gradient accumulation, mixedprecision FP16; • default AdamW (Loshchilov and Hutter, 2017) with lr = 4e-5, Layer-wise Decreasing Layer Rate (Sun et al., 2019) with decay η = 0.95 and cosine learning rate (LR) schedule with T = 4 epochs and constant LR after epoch 3; • selected bert-base-uncased as best performance / speed ratio; • the best model on validation selection each 0.1 epoch by AUC.
For the second step: • hold-out ≈ 14% of data to train ensemble of models later; • out-of-5-fold training on the residual ≈ 86% of data; • 4 epochs, 512 tokens max length, 16 batch size, 10 gradient accumulation, mixed precision FP16; • default AdamW with lr = 4e-5, Layer-wise Decreasing Layer Rate with decay η = 0.95 and cosine LR schedule with T = 4 epochs; • the best model on validation selection each 0.1 epoch by F1-score.
The final solution contains N ×K models, where N is the number of different backbone BERT architectures, K is the number of folds (5 in the current experiments). Obtained models are further to be ensembled using different strategies with validation on the single hold-out dataset: • hard voting: final spans are selected as at least one model (spans union), as all the models (spans intersection) or as some intermediate methods with at least m models; • soft voting: final probability is calculated as a weighted sum of models probabilities; • train meta classifier.
So as models except bert-base-uncased did not show compatible performance for token classification (and later for tests on the fold 0 did not show good F1-score for the actual task as well), later experiments were continued only for bertbase-uncased pre-trained model fine-tuned on token classification.
For step two results are following: • train + trial, 8621 comments; • average F1-score over 5 folds is 0.6714.
The experiments were conducted with Huggingface transformers library .
Many patterns in our results are expected, but some stand out. In general, our model is good at detecting obscene language and utterances that demean honor and dignity or denote low moral character. We noticed that our model is not very good at identifying the posts that have no toxic span annotations. According to the corpus description, in some toxic posts, the core message is conveyed may be inherently toxic. Thus, a sarcastic post can indirectly claim that people of a particular origin are inferior. Hence, it was difficult to attribute the toxicity of those posts to particular spans. In such cases, the corresponding posts were labeled as not containing toxic spans. Among our results, there are many examples where the model detected spans in not annotated posts, for example: • "uhhh Hillary Clinton is a serial killer and thief": [] (true annotation), [26,27,28,29,30,31,33,34,35,36,37,38,44,45,46,47,48] (our annotation, "uhhh Hillary Clinton is a serial killer and thief"); • "This goes way beyond just being an asshole skipper, dude must have some serious mental issues": [] (true annotation), [35,36,37,38,39,40,41] (our annotation, "This goes way beyond just being an asshole skipper, dude must have some serious mental issues").
In addition, some texts in the dataset raise questions of the annotation credibility, for example: • "How the hell is this news? Am I supposed to be shocked that the Crown Prince of Bahrain or one of the world's biggest celebrity superstars get's better access to the State Department then I do? During which administration has this ever not been true? The media's desperation to keep this election close is far past ridiculous" (training set, the toxic span annotation is underlined); • "Yup. NVN used the Press. The Press was USED. Used like their sister on prom night! Idiots. All faux-erudite, not realizing they were being played" (training set, the original annotation is underlined); • "And you are a complete moron who obviously doesn't know the meaning of the word narcissist. By the way your bias is showing" (test set, the original annotation is underlined, the annotation of our model is highlighted in bold).
The final result of our model is presented in Table 1. As can be seen from the table, the systems of the participants produce close results. Our system achieved 67.55% of F1-score on the test set of this shared task that attracted 91 submitted teams in total. This value exceeded the average result by almost 10%.

Conclusion
This paper introduces our BERT-based model for toxic spans detection. As expected, pre-training of the BERT model using an additional domainspecific dataset improves further toxic spans detection performance. Experimenting with different fine-tuning approaches has shown that our BERTbased model benefits from the two-step knowledge transfer technique. An ensemble with spans intersection obtained our best result on the test data.
In our future work, we will evaluate various language models, such as distilled versions of BERT Jiao et al., 2020) and RoBERTa (Liu et al., 2019b).