UPB at SemEval-2021 Task 5: Virtual Adversarial Training for Toxic Spans Detection

The real-world impact of polarization and toxicity in the online sphere marked the end of 2020 and the beginning of this year in a negative way. Semeval-2021, Task 5 - Toxic Spans Detection is based on a novel annotation of a subset of the Jigsaw Unintended Bias dataset and is the first language toxicity detection task dedicated to identifying the toxicity-level spans. For this task, participants had to automatically detect character spans in short comments that render the message as toxic. Our model considers applying Virtual Adversarial Training in a semi-supervised setting during the fine-tuning process of several Transformer-based models (i.e., BERT and RoBERTa), in combination with Conditional Random Fields. Our approach leads to performance improvements and more robust models, enabling us to achieve an F1-score of 65.73% in the official submission and an F1-score of 66.13% after further tuning during post-evaluation.


Introduction
Nowadays, online engagement in social activities is at its highest levels. The lockdowns during the 2020 COVID-19 pandemic increased the overall time spent online. In Germany for instance, Lemenager et al. (2021) observed that 71% of considered subjects increased their online media consumption during this period. Unfortunately, online toxicity is present in a large part of the social and news media platforms. As such, automated early detection is necessary since toxic behavior is often contagious and leads to a spillover effect (Kwon and Gruzd, 2017).
Recently, a significant effort was put into the detection of toxic and offensive language (van Aken et al., 2018;Paraschiv and Cercel, 2019;Tanase et al., 2020b,a), but the challenging nature of these problems leaves several avenues unexplored. In addition, most shared tasks focus on the distinction between toxic/non-toxic (Wulczyn et al., 2017;van Aken et al., 2018;Juuti et al., 2020) or offensive/non-offensive posts in various languages (Struß et al., 2019;Zampieri et al., 2019aZampieri et al., ,b, 2020Mandl et al., 2020;Aragón et al., 2020). The Semeval-2021 Task 5, namely Toxic Spans Detection (Pavlopoulos et al., 2021), tackles the problem of identifying the exact portion of the document that gives it toxicity. The provided dataset is a subset of the Jigsaw Unintended Bias in Toxicity Classification dataset 1 , with annotated spans that represent toxicity from a document.
In this paper, we describe our participation in the aforementioned Toxic Spans Detection task using several Transformer-based models (Vaswani et al., 2017), including BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), with a Conditional Random Field (CRF) (Lafferty et al., 2001) layer on top to identify spans that include toxic language. We introduce Virtual Adversarial Training (VAT) (Miyato et al., 2015) in our training pipeline to increase the robustness of our models. Furthermore, we enhance part of our models with character embeddings based on the Jigsaw Unintended Bias dataset to improve their performance. Finally, we compare the proposed models and analyze the impact of various hyperparameters on their performance.
The rest of the paper is structured as follows. The next section introduces a review of methods related to toxic language detection, sequence labeling, and adversarial training (Kurakin et al., 2016). The third section discusses the employed models, as well as the VAT procedure. Results are presented in the fourth section, followed by discussions, conclusions, and an outline of possible future works.

Related Work
Toxic Language Detection. There are several research efforts to detect toxic texts based on the Jigsaw Unintended Bias dataset, out of which most focus on the Kaggle competition task -predicting the toxicity score for a document. Morzhov (2020) compared models based on Convolutional Neural Networks (CNNs) (Kim, 2014) and Recurrent Neural Networks (Cho et al., 2014) with a Bidirectional Encoder Representations from Transformers (BERT) architecture (Devlin et al., 2019), obtaining the best performance from an ensemble of all used models. Gencoglu (2020) and Richard and Marc-André (2020) used the same dataset to improve on the automatic detection of cyberbullying content.
Sequence Labeling. Predicting the type for each token from a document rather than providing a label for the whole sequence is a task often associated with named entity recognition (Ma and Hovy, 2016), but can be performed in other Natural Language Processing pipelines, including part-ofspeech tagging (Ling et al., 2015) and chunking (Hashimoto et al., 2017). A common practice in sequence tagging models (Peters et al., 2018;Avram et al., 2020;Ionescu et al., 2020) is to use a CRF as a final decoding layer.
Adversarial Training. Researched first in image classification (Szegedy et al., 2013), adversarial examples are small input perturbations that are hardly distinguishable for humans, but can dramatically shift the output of a neural network. These examples can be used in adversarial training (AT) (Goodfellow et al., 2014) as a regularization method that can increase the robustness of the model. Using the worst-case outcome from a distribution of small norm perturbations around an existing training sample, a new data point is created and inserted into the training process.
Extending AT to a semi-supervised setting, VAT (Miyato et al., 2016) does not require label information for the adversarial examples. VAT aims to increase the local distributional smoothness by adding perturbations to the embedding output. Recently, several studies (Kumar and Singh, 2020;Si et al., 2020) focused on applying VAT in Transformer-based models and obtained improvements in comparison to baseline methods on several classification tasks.

Corpus
The dataset for the competition is a subset of the Jigsaw Unintended Bias in Toxicity Classification English language corpus, with annotated spans that make the utterance toxic. From the 8,597 trial and train records, 8,101 had at least one toxic span. By cross-referencing with the original Jigsaw dataset which contains additional information, we retrieved the toxicity scores for each text and determined that the mean toxicity score for the train and test set were very close (0.8429 versus 0.8440; see Figure 1 for corresponding kernel density estimates). Moreover, only 17 out of 2,000 test data rows had a toxicity score below 0.75. Nevertheless, an offbalance was noticed between the test and train set -80.3% entries from the test set had at least one toxic span versus a considerably higher density of 94.2% in the train set. The training dataset was split into sentences while ensuring that there are no splits inside a toxic span and there are no sentences shorter than three words. Under these settings, our training dataset consists of a total of 26,589 sentences, including 10,117 records that contained toxic spans; 15% were selected for validation. Another 2,000 entries were provided by the competition organizers for testing; the labels for this dataset were made available after the competition.
For our unsupervised training samples, we selected 20,000 random records from the Jigsaw dataset, making sure there was no overlap with the Semeval-2021 training data. Additionally, we replaced all URL-s with a special token and applied lower case on all records.

Virtual Adversarial Training
The robustness of the model in Adversarial Training is improved through examples that are close to available training data, but the model would be likely to assign a different label than the training one, thus leading to loss increase. In VAT, Miyato et al. (2018) adapted the adversarial training from supervised to semi-supervised settings by adding an additional loss using the Kullback-Leibler divergence between the predictions of the original data and the same data with random perturbations.
Since the output distributions are compared, the information about labels is not needed for the adversarial loss: where e is the embedding associated with the sample, d the perturbation, andŷ is the predicted output. True labels are required in general to compare the losses and find the worst case perturbations. However, this can be avoided by bounding the norm of the perturbation δ to η; thus, the value of the perturbation becomes: Afterwards, we can estimate the perturbation d using also the gradient g and a hyperparameter for the magnitude by applying the second-order Taylor approximation and a single iteration of the power method: In order to reduce the complexity and computation for the gradient, we ignore the dependency on Θ. Also, the number of power iterations can be another hyperparameter for the model. The final loss function used by all models is a combination of the supervised and unsupervised adversarial loss: where γ is another tunable hyperparameter.

Implementation Details
In our experiments, pre-trained Transformer models are followed by a linear transformation of their last hidden state, and a final CRF layer. More precisely, we compare the effectiveness of several flavors of BERT models, alongside the VAT technique as follows: BERT base, a 768-dimensional model provided by Google (BERT-base-CRF-VAT), Unitary's toxic BERT (Hanu and Unitary team, 2020) (BERT-toxic-VAT), BERT pre-trained on fake and hyperpartisan news (Paraschiv et al., 2020) (BERTnews-CRF-VAT), and RoBERTa-large-CRF-VAT, the equivalent of BERT-base-CRF-VAT that relies on RoBERTa instead of BERT. In addition to these models, we experimented with enhancing the BERT-based representation with character embeddings (Kim et al., 2016). These character representations were trained on the entire Jigsaw dataset using a CNN-BiLSTM model (Ma and Hovy, 2016) with the next character prediction objective. We concatenated the obtained character-level embeddings with the aforementioned Transformer's last hidden state, and refer to this variant as BERT-news-CRF-VAT+chars.
As baseline systems, we design two methods: LSTM-CRF-VAT with GloVe embeddings (Pennington et al., 2014) and a LSTM-CRF-VAT+chars having character-level embeddings and VAT. In all BERT-based models, we used a maximum sequence length of 96 tokens and a sequence of 64 tokens for the LSTM baseline. Since the input words can consist of more than one token, we assign the toxicity label to a word if at least one component token is inferred as toxic.
The best hyperparameters for the BERT-base model were determined through grid search on the development set. The identified optimal values ( = 2, η = 0.1, and two power iterations) were used in all other flavors; γ was set to 0.5 in the final loss function to balance both approaches. Furthermore, all BERT-based models were trained for one epoch in contrast with the LSTM-CRF-VAT and LSTM-CRF-VAT baselines that were trained for three epochs and four epochs, respectively.

Results
The evaluation metric for the Toxic Spans Detection task was an adapted version of the F1-score (Da San Martino et al., 2019) that takes into account the size of the overlap between prediction spans and golden labels.
Results for all developed models with the aforementioned hyperparameters (i.e., γ = 0.5, = 2, η = 0.1, and two power iterations) are presented in Table 1. Since the training data had a slightly different distribution of the span density, part of our models that performed worse on our dev set performed better on the competition test set. Adding the character embedding representation to BERTbased models did not prove to be of use in our pre-evaluation tests, but in post-evaluation, we noticed that slightly tweaking the γ hyperparameter for the loss from 0.5 to 0.6 brought the F1-score to 66.13%. Despite performance on the validation set was insensitive to the change in γ between 0.5 and 0.6, the results on the test set were more than 1.5% apart. This is mostly due to the unsupervised training that is strengthening the model's confidence on edge cases which would lower its precision. Figure 2 introduces the influence of the perturbation magnitude on the overall performance of three models. The impact of in the adversarial training effectiveness is significant, but it is also highly dependant on the used model and can only be determined experimentally.
Our models performed well on the detection task, learning not only common toxic expressions like "moron", "stupid", "pathetic troll", "disgusting", "hang-em high", but also obfuscated expressions like "f*cking nasty" and "b*tchy". Nonetheless, the models fail to detect more obscured words like "you don't know s***" or "Kill this F'n W*ore on site". All models have the tendency to over-predict toxicity by adding words to the toxic expressionfor example, "What a pile of shit" was automatically labeled as "What a pile of shit".
The character-level embeddings boosted the performance of the baseline LSTM-CRF-VAT model but did not improve any BERT model since it leads to detecting longer spans as toxic (see Table 2) which in return lowers precision.

Discussions and Error Analysis
In this section, we analyze the BERT-toxic-CRF-VAT performance versus the golden label values from the competition test set. The precision and recall for our best model are 65.74% and 85.54%, respectively, which are indicative of a tendency to over-predict toxic spans. As we mentioned in section 3.1, even though almost all documents in the test set had a high toxicity score in the original Jigsaw dataset, many had no annotated toxic spans. Indeed, there were 295 records where our model detected a toxic span and none were labeled in the test set. Table 3 includes examples of such detected spans those labeling is subjective and other detection errors. Words like "stupid", "dumb", and "crap" were assigned as toxic throughout the test data due to their high presence in the training data spans. There were also milder errors, spans that overlap with the golden labels, but the model omits Annotated sample BERT-news-CRF-VAT+chars BERT-toxic-CRF-VAT Greedy pig strikes again! Greedy pig strikes again! Greedy pig strikes again! What's honorable is telling idiots that they're idiots (even if tardily) What's honorable is telling idiots that they're idiots (even if tardily) What's honorable is telling idiots that they're idiots (even if tardily) A fool and our money are soon to be parted.
A fool and our money are soon to be parted.
A fool and our money are soon to be parted. Table 2: Examples of model predictions with character embeddings (BERT-news-CRF-VAT+chars) and without (BERT-toxic-CRF-VAT), compared to the official annotations. The toxic spans are highlighted.

Annotated sample Model predictions
As is Drumpf a lying gasbag.
As is Drumpf a lying gasbag . The Liberals should listen to the economic experts and dramatically increase immigration levels (at least 500,000 per year). Stop listening to the loser racist trash.
The Liberals should listen to the economic experts and dramatically increase immigration levels (at least 500,000 per year). Stop listening to the loser racist trash . AK Jen -Russia is no longer a communist nation. Stop spouting that -it makes you look stupid.
AK Jen -Russia is no longer a communist nation. Stop spouting that -it makes you look stupid . The "human rights" squad need a swift kick in the derriere.
The "human rights" squad need a swift kick in the derriere . Rabidly anti-Canadian troll .
Rabidly anti-Canadian troll . Only in that sick and twisted brain stem of yours. Table 3: Examples from the competition test dataset of differences between the annotations and the predictions from BERT-toxic-CRF-VAT model. The toxic spans are highlighted. part of the sequence of words. Samples like "You are a nut ball" detected only as "You are a nut ball" or "So like every other racist, he's a hypocrite" marked by the model as "So like every other racist, he's a hypocrite" can be perceived as likely errors even for human annotators.

Conclusions and Future Work
In this paper, several Transformer-based models (i.e., BERT and RoBERTa) were tested together with Virtual Adversarial Training to increase their robustness for identifying toxic spans from textual information. Our experiments argue that applying VAT increases performance and that domainspecific models have higher performance when compared to larger general models.
In terms of future work, we plan to experiment with self-supervised adversarial training  to improve the robustness of our models. As we noticed in this dataset too, online users find clever ways to hide offensive and toxic expressions. Adversarial training can be effectively employed to detect these attempts and a study of its impact on offensive and hate speech classifiers is worth pursuing as follow-up leads.