YoungSheldon at SemEval-2021 Task 5: Fine-tuning Pre-trained Language Models for Toxic Spans Detection using Token classification Objective

In this paper, we describe our system used for SemEval 2021 Task 5: Toxic Spans Detection. Our proposed system approaches the problem as a token classification task. We trained our model to find toxic words and concatenate their spans to predict the toxic spans within a sentence. We fine-tuned Pre-trained Language Models (PLMs) for identifying the toxic words. For fine-tuning, we stacked the classification layer on top of the PLM features of each word to classify if it is toxic or not. PLMs are pre-trained using different objectives and their performance may differ on downstream tasks. We, therefore, compare the performance of BERT, ELECTRA, RoBERTa, XLM-RoBERTa, T5, XLNet, and MPNet for identifying toxic spans within a sentence. Our best performing system used RoBERTa. It performed well, achieving an F1 score of 0.6841 and secured a rank of 16 on the official leaderboard.


Introduction
Internet and social networking sites have brought people together by providing a simple yet effective method of communication. Over the years people used it to exchange positive ideas but recently, there has been a rise in toxic content and hate speech over the internet (Zampieri et al., 2019(Zampieri et al., , 2020. Most datasets (Fortuna et al., 2020) dealing with the problem of toxic, offensive, or hateful content aim to classify the entire text belonging to a particular class. They do not identify the parts of the text that make it toxic. Manual filtering of toxic data is tough and can cause mental and emotional stress to annotators (Zampieri et al., 2019). An automatic system with the ability to identify toxic text and highlighting toxic spans can be useful for the moderators. It will help save time and prevent stress caused by reading long texts. SemEval 2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021) draws attention to the problem of identifying toxic spans present in a sentence.
Our proposed system makes use of a word-level classifier for detecting the offensive words present in a sentence. The offsets of the toxic words can then be concatenated to find the toxic spans. We made use of pre-trained language models (PLMs) for building our classifier. We experimented with BERT (Devlin et al., 2019), ELECTRA (Clark et al., 2020), RoBERTa (Liu et al., 2019b), XLNet (Yang et al., 2020), MPNet (Song et al., 2020), T5 (Raffel et al., 2020), and XLM-RoBERTa (Conneau et al., 2020) to compare their performance on the task of toxic spans detection. Owing to the increase in the number of pre-trained language models choosing the correct model is an important decision as these models contain millions of parameters and are expensive to train. So, we present a comprehensive analysis of the performance of different models, which can serve as a baseline for future work. Our best performing system was fine-tuned using RoBERTa and attained an F1 score of 0.6841. It was ranked 16 on the official leader board. We used different PLMs for fine-tuning and found exceedingly small variations in their performance. Further analyzing our model's performance on the test set we observed that it is essential for the model to not only detect toxic spans but also decide if it needs to predict toxic spans for that sample or not. Our code is available online 1 for method replicability.

Background
Identification of toxic/offensive content is an important task in natural language processing. It is essential for the moderation of harmful content over social media sites that might hurt the sentiments of individuals, groups, or communities at large. Much work has been done on the identification of offensive content. OffensEval 19, 20 (Zampieri et al., 2019(Zampieri et al., , 2020 provide a comprehensive analysis of methods useful for the identification of offensive content. SemEval 2020 Task 8: Memotion analysis (Sharma et al., 2020) presented with a dataset of internet memes with one sub-task to detect and quantify offensive content. Work done in (Brassard-Gourdeau and Khoury, 2019) explores different aspects of sentiment detection and their correlation to toxicity. (Pavlopoulos et al., 2020) covers the effect of context on toxicity. (D' Sa et al., 2020) uses BERT and FastText for toxicity detection. (Kurita et al., 2019) covers several attacks to by-pass toxic content filters and methods to make the filters robust to such attacks. Recent state-of-theart systems (Wiedemann et al., 2020;Wang et al., 2020;Liu et al., 2019a;Nikolov and Radivchev, 2019) performed well in identifying offensive content. Work done in (Gröndahl et al., 2018) shows that although recent systems perform well on given datasets, very slight changes made by adversaries may fool the models. Adding words like "love" to offensive tweets may make it less offensive.
Identifying toxic content is an important NLP task. It is useful in moderating online content over the web having millions of users. Most problems deal with labeling the entire content as toxic/nontoxic. None of the previous work has tried to identify spans within a text that makes it toxic. SemEval-2021 Task 5: Toxic Spans Detection aims to bring attention to this problem via the task defined as: Given a dataset D of sentences, the objective of the task is to learn a classification function that can predict the toxic spans T present in the given sentence. The content of the provided dataset D was in English.
Dataset statistics: The dataset for the task consisted of character offsets for toxic spans present for each text sample. The span consisted of single words as well as a collection of words. Table 1 shows the count of samples having different number of toxic words.
From Table 1 we can infer that samples with toxic words within the range of one to three form a major component of the dataset. In the test set, samples with no toxic words were significantly more than the training and development set. Toxic words with the highest frequency of occurrence present in the training set are given in Table 2. We observed that toxic words contained stopwords (the, a, and, of) which are generally not toxic when used independently. These stopwords can exist as part of multiword toxic spans.

Pre-trained Language Models
Natural language processing tasks are data intensive. Training deep neural networks for NLP tasks requires large amounts of training data that might not always be available. To overcome this problem researchers proposed pre-training large language models which can be fine-tuned on various downstream tasks. Pre-training involves training general representations of text to understand its syntactic and semantic relations. The main advantage of pre-training is that it can be done on unlabelled text corpus allowing training on a large amount of textual data. The pre-trained language models can then be used across various downstream tasks by fine-tuning them on task-specific datasets.

Brief overview of used PLMs
BERT: It is a bidirectional language model based on the Transformer architecture (Vaswani et al., 2017). It uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) as a pretraining objective.
ELECTRA: It is one of the most recent models and is inspired by generative adversarial networks. It introduces Replaced Token Detection (RTD) pretraining objective.
RoBERTa: It is a modification of BERT proposed by Facebook. It uses dynamic masking as a part of the pre-training objective. NSP was removed and the model was pre-trained on larger data for more time.
XLNet: It is a generalized auto-regressive pretraining method using the best of both Auto Regressive(AR) and Auto Encoding(AE) modeling techniques. It makes use of permutation language modeling (PLM) objective for pre-training.
MPNet: It was proposed by Microsoft. It overcomes the pre-train fine-tune discrepancy in XLNet. It makes use of both PLM and MLM to map the dependencies among predicted tokens as well as use full positional information in a sentence.
T5: It was proposed by Google and aimed to reframe all NLP tasks into a single text-to-text format where both inputs and outputs are always strings. It used a masking objective similar to BERT and used teacher forcing for pre-training.

Modelling as Token Classification Task
The given dataset provided spans of toxic content in a statement. Each sentence could contain multiple toxic spans. Another important thing to note was that a toxic span could comprise more than one word. We extracted all toxic words using the toxic spans. If a span contains over one word, it was further processed to extract individual words. Once we found all the toxic words, we split the original sentence to label the toxic/non-toxic words. Before splitting the original sentence, we removed extra whitespace and newline characters. We removed any punctuation before or after the word. Punctuations present within the words were not removed. Figure 1 shows an example of the process. The toxic spans have been highlighted in red in the original sentence which, we convert into an array of words labeled as toxic/non-toxic.
The next step is to prepare the data for finetuning on pre-trained language models. PLMs use tokenization to break the original words into sub-words. Different models use different tokenization techniques like Byte-Pair-Encoding(BPE) (Sennrich et al., 2016), WordPiece (Schuster and Nakajima, 2012), and SentencePiece (Kudo and Richardson, 2018). One advantage of using tokenization is that it helps to reduce the vocabulary size. One challenge it poses for token classification tasks is which sub-word to use for classification. Different models also add special tokens like [CLS], [SEP], start, end tokens which are not required for the token classification task. In our approach, we used the first sub-word of the tokenized word for classification. We masked the remaining sub-words and special tokens while computing the loss. The sub-words were masked only during loss computation and not while being passed through the model. This allowed all sub-words to learn dependencies within the sentence. Figure 2 shows the tokenized words and their corresponding labels using the BERT tokenizer.

Fine-tuning
We used a simple approach for fine-tuning the model for token classification. We used a token classifier on top of features learned by PLMs. Our classifier consisted of three layers on top of PLM features. First was the batch normalization layer, followed by a dropout layer. The final layer was a time-distributed dense layer over features of each tokenized word containing a single neuron and a sigmoid activation to predict if the given token is toxic/non-toxic.

Masked Loss
As described, we reduced the problem to a token classification task where we predict the label for each word. We used binary cross-entropy loss for the fine-tuning process. In cases where the original word is broken down into multiple sub-words, we used only the first sub-word for calculating the loss. We created masks for each sentence to store the position of words/sub-words. Cross-entropy loss was calculated for required sub-words/words using the masks and then summed up over all tokens in a sentence. The summed value was the loss for a given sentence.

Hyperparameters and Training
Our models were developed on Keras 2 (Chollet et al., 2015) using HugginFace's 3 implementation of transformer 4 (Wolf et al., 2020) models. We fine-tuned the models on TPU's on Google Colab. We fixed the sequence length of input to 150 tokens. We padded/truncated the sequences according to their length. Our model was fine-tuned using the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear learning rate decay against masked binary cross-entropy loss. We experimented with learning rates of 1e-4, 3e-5, 4e-5, 5e-5 for each PLM architecture. Fine-tuning was done for 4 epochs. Each PLM architecture with the best performance on the development set was used for making final predictions on the test set.

Predicting Toxic Span Offsets
Our model was trained to find the toxic words. In case the word was tokenized into sub-words, we used the first sub-word to determine the toxic nature of the entire word. We stored flag values for each sentence to find the correct label for each word during prediction. Once we found the toxic words, we searched for them in the original un-processed sentences. We concatenated the spans for all predicted toxic words which was the final expected output.

Evaluation Metric
The performance of the model was evaluated using the F1 score as described in (Da San Martino et al., 2019). Let system A i return a set S t A i of character offsets found toxic for post t. Let G t be  ground truth annotation for t. F1 score of system A i with respect to ground truth values G for post t is calculated as follows: where |.| represents the cardinality of the set. If S t G = 0 i.e no toxic spans are present in t then was averaged over all posts t present in dataset D to obtain single score for system A i . Table 3 shows the performance of our proposed model on different PLMs. Learning rate of 1e-4 was used for ELECTRA, 4e-5 for MPNet, and 5e-5 for the remaining PLMs to obtain the abovementioned results. RoBERTa had the best performance on the test set while MPNet had the best

Model
No. of toxic words = 0 No. of toxic words >0 F1 = 1 F1 = 0 F1 = 1 F1 = 0  RoBERTa  24  370  1061  96  BERT  24  370  1050  99  ELECTRA  22  372  1007  76  MPNet  18  376  1063  101  T5  24  370  1041  92  XLNet  32  362  1059  112  XLM-RoBERTa  19  375 1056 104  performance on the development set. Our best performing model achieved a best F1 score of 0.6842 on the test set and was ranked 16 on the official leader board. We further analyzed the performance of our model on the test set. We evaluated the performance of our model on samples containing any number of toxic words vs no toxic words. Table 4 shows the results of the analysis. We found that our models performed significantly well for samples having one or more toxic words present and, our best performing model had a perfect F1 score on 66.06 % of them. Our model was unable to find toxic words in only 5.97% of samples containing one or more than one toxic word.
In the case of samples that had no toxic words in a sample, our model could not perform well. Only 6.09% of samples with no toxic words were classified correctly. The dataset statistics for the test set show that samples with no toxic words constitute 19.7 % of the test set. The training and development set had only 6.12% and 6.23% samples without any toxic words. We also found the top 15 most common words which were predicted as toxic from samples containing no toxic words in the test set. The words are given in Table 5 along with their frequency of occurrence.
We can observe that Table 2 and 5 has common words. We trained our model using token classification objective which tries to capture toxic words. The model cannot identify if the word is part of a toxic/non-toxic sentence. Sometimes these words may be part of a sentence intended to present humor or sarcasm. This may lead the model to incorrectly identify toxic words in samples containing no toxic spans.

Conclusion
In this paper, we describe our approach for Se-mEval 2021 Task 5: Toxic Spans Detection. We propose a word-level classifier for identifying the toxic words in a sentence. We experimented with different PLMs to provide a comprehensive analysis of their performance for identifying toxic spans. We performed well, getting a rank of 16 on the leader board. Our analysis shows that a word-level classifier performs extremely well for sentences that contain at least one toxic word. However, it cannot identify cases with no toxic spans efficiently. In the future, we would like to work on solving this problem by using a classifier to simply predict if the sentence is toxic/non-toxic along with span detection.