NLRG at SemEval-2021 Task 5: Toxic Spans Detection Leveraging BERT-based Token Classification and Span Prediction Techniques

Toxicity detection of text has been a popular NLP task in the recent years. In SemEval-2021 Task-5 Toxic Spans Detection, the focus is on detecting toxic spans within English passages. Most state-of-the-art span detection approaches employ various techniques, each of which can be broadly classified into Token Classification or Span Prediction approaches. In our paper, we explore simple versions of both of these approaches and their performance on the task. Specifically, we use BERT-based models - BERT, RoBERTa, and SpanBERT for both approaches. We also combine these approaches and modify them to bring improvements for Toxic Spans prediction. To this end, we investigate results on four hybrid approaches - Multi-Span, Span+Token, LSTM-CRF, and a combination of predicted offsets using union/intersection. Additionally, we perform a thorough ablative analysis and analyze our observed results. Our best submission - a combination of SpanBERT Span Predictor and RoBERTa Token Classifier predictions - achieves an F1 score of 0.6753 on the test set. Our best post-eval F1 score is 0.6895 on intersection of predicted offsets from top-3 RoBERTa Token Classification checkpoints. These approaches improve the performance by 3% on average than those of the shared baseline models - RNNSL and SpaCy NER.


Introduction
Offensive language can include various categories such as threats, vilification, insults, calumniation, discrimination and swearing (Pavlopoulos et al., 2019). Detection of such language is necessary for ease of moderation of content on social media. Despite their popularity, toxicity detection tasks have focused majorly on sequence classification, rather than sequence tagging. Finding which spans make a comment or document toxic in nature is crucial in explaining the reasons behind their toxicity. Additionally, such attributions would allow for more efficient semi-automated quality-based moderation of content, especially for verbose documents, in comparison to quantitative toxicity scores.
In SemEval-2021Task-5, Pavlopoulos et al. (2021 provide a dataset of 10k English texts filtered from Civil Comments (Borkan et al., 2019) dataset. Each text is crowd-annotated with character offsets that make the text toxic. The task is to predict these character offsets given the text. The work presented in this paper aims to provide a comprehensive analysis of simple Token Classification (TC) and Span Prediction (SP) methods across multiple BERT-based models - BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and SpanBERT (Joshi et al., 2020). Additionally, we experiment with a few hybrid approaches -Multi-Span (MSP), where the model is trained on multiple spans simultaneously; Span+Token (SP-TC), where the model is trained on both kinds of tasks simultaneously; LSTM-CRF (LC), which uses a LSTM and CRF layer on top of BERT-based models; and a combination of predicted offsets for above techniques using union/intersection. In Section 2, we perform a compendious literature survey. Section 3 elucidates our approach, including the modelling aspect, the various variants of the base model, and the different Hybrid Systems. In Section 4, we describe our experimental setup and hyperparameters used for our methods. Lastly, in Section 5 we analyze our results and perform ablative analysis on our systems.

Background
Before the advent in research pertaining to toxic texts, Warner and Hirschberg (2012) modeled hate speech as a word sense disambiguation problem where SVM was used for classification of data.  used RNN Language Model with character and token based methods to classify the text. Recently, however, toxic text detection has garnered a lot of attention (Nobata et al., 2016;Park and Fung, 2017;Pavlopoulos et al., 2017;Wulczyn et al., 2017). The increase in offensive language research can partly be credited to various workshops such as Abusive Language Online 1 (Waseem et al., 2017) , as well as other fora, such as GermEval for German texts, 2 or TRAC (Kumar et al., 2018) and Kaggle challenges 3 .
Hanu and Unitary team (2020) introduced Detoxify, a comment detection library modeled using HuggingFace's transformers (Wolf et al., 2020) to identify inappropriate or harmful text online as a result of participation in three such challenges. In a contemporary work, Pavlopoulos et al. (2020) discuss context requirement for toxicity detection.
In SemEval 2020-Task 11 (Da San Martino et al., 2020), the first sub-task -Span Identification -aims at detecting the beginning and the end offset for the propaganda spans in news articles. This sub-task is similar to SemEval 2021-Task 5. The proposed approaches for the sub-task can be broadly classified into Span Prediction or Token Classification. Most teams use multi-granular transformer-based systems for token classification/sequence tagging (Khosla et al., 2020;Morio et al., 2020;Patil et al., 2020). Inspired by Souza et al. (2019), Jurkiewicz et al. (2020 use RoBERTa-CRF based systems. Li and Xiao (2020) use a variant of SpanBERT span prediction system.

Baseline Models
From the models already provided with the dataset, we use RNNSL and SpaCy NER Tagging baselines for token-wise classification.
RNNSL model is a combination of a single Bi-LSTM layer with a randomly initialized embedding layer. It uses a three-label classification task for each word in the sentence. The labels used are: special token, non-toxic word, and toxic word. For 1 https://sites.google.com/site/ abusivelanguageworkshop2017/ 2 https://projects.fzai.h-da.de/iggsa/ 3 Jigsaw Toxic Comment Classification Challenge each word, the corresponding offsets are added to the predicted spans. A word with containing any toxic offset is marked as toxic during training. SpaCy NER Tagging model is an NER classifier built on SpaCy Language Models. It is used to predict the entities which are labelled as TOXIC in the text using the spans provided.

BERT-based Token Classification Models
These models comprise a BERT-based model and a classification layer over each final token embedding which predicts whether a token is toxic or not. Based on these classifications, we add the offsets for those tokens (not words) which are marked as toxic by the model. Figure 1a represents a Token Classification Model.

BERT-based Span Prediction Models
We use the BERT-based Span Prediction ( Figure  1c) models based on Extractive Question Answering systems similar to work on SQuAD (Rajpurkar et al., 2016) and MRQA (Fisch et al., 2019). In these systems, the output at each token is a start logit and an end logit denoting whether that token is a start token or an end token of the span, depending on the softmax value. Since the Toxic Spans text can have multiple toxic spans, we take different contiguous spans from the given offsets, and make several 'samples' out of the example. Each span becomes an 'answer' for the particular text sample. We use the word 'offense' as a dummy question. Thus, each contiguous span leads to one 'sample' for every example (Table 1). We store the start index of the text, similar to the SQuAD (Rajpurkar et al., 2016) dataset, and process the data to provide start and end token positions during training. The classifier layer on top of the encoder embeddings performs a binary classification task for start and end positions. A  span is scored using the sum of predicted start and end logits. From top-K start and end logits, valid predicted answer spans 4 are chosen during postprocessing. A union of all the corresponding offsets is taken to give the final prediction for the example. A threshold is learned on the span scores using the resulting dev set F 1 score on offsets, which is then used for test set prediction. All spans with score above threshold are considered to be toxic spans.

Multi-Spans
In Section 3.2, we allow each context to have multiple single-span answers during training. This is counter-intuitive, as the model is only trained to handle a single-span at a time, and expected to predict multiple single-spans during prediction. Two toxic spans in text are equally important to predict, and thus, should not be shown at different times during training. To mitigate this issue, we try an approach which we refer to as the 'Multi-Spans' (MSP) approach. Here, we take all the ground start and end token positions during training, and use Binary Cross Entropy on each of the start/end logits. This essentially treats the task as a multi-label classification problem. Hence, during training, all the ground spans are used in the same iteration with the example, and only one 'sample' per example is generated. Figure 1d depicts a representation of the system. Note that two tokens -dumb and pathetic are marked as the start token. Similarly, both ignorant and troll are marked as the end token. 4 Valid spans are those which have end index greater than start index, and length less than a maximum span length.

LSTM-CRF
A recently popular approach in Named-Entity Recognition tasks has been to use Conditional Random Fields (CRF) with BERT-based models. Inspired by the CRF-based approaches (Souza et al., 2019;Jurkiewicz et al., 2020), we use BERT-based models with a single BiLSTM layer and a CRF layer. During training, the CRF loss is used and during prediction, Viterbi Decoding is performed. Though CRF is generally used for word-level classification, we do not mask inner and end tokens for a word as it degrades dev set performance for our systems. Hence, all the tokens of a word are considered for classification.

Spans+Token
For this system, we use a combination of the two tasks -Token Classification and single-span Span Prediction. We use two classification layers on the token-wise embeddings -one for start and end prediction, and the other for token classification. Training is done simultaneously on both tasks, and the cross-entropy loss for each classifier is weighted. The overall loss is given as: where s t ,e t , and p t are labels for start, end and token classifiers for token t, whileŝ t ,ê t andp t are predictions. This is done to equally scale both SP and TC task losses. During prediction, we consider top-K start and end scores. From the valid spans, the score is calculated as the average of start and end logit scores, as well as the mean of toxicity logits over the span under consideration. The score is given as: where i s and i e are start and end indices,ŝ is and e ie are start and end logits at those indices, andt k is toxicity logit at index k. A threshold, similar to Section 3.2 is tuned on the dev set. The predicted offsets taken from the predicted spans are considered to be toxic.

Combination of Offset Predictions
Chen et al. (2017) proposed using the predictions from top few checkpoints and averaging the results to achieve better classification scores. Based on a similar line of thought, we also combine the predicted spans for various checkpoints of a model, as well as across different models using union or intersection.

Hardware Requirements
The training and the evaluation of systems was performed on Google Colab's free GPU (NVIDIA K80/P100). The training time varies with the models. For each model, it is around 4-6 hours, which is well-within the 12 hour limit of Colab.

Models & Hyperparameters
For RNNSL, a Keras-based BiLSTM model is provided. We use a max length of 192, batch size of 32 and a dropout of 0.1. The training is done using Adam Optimizer with early stopping (patience period = 3), which in our case halts at 5 epochs. The embedding/hidden state size used is 200. A threshold is used to classify a word as toxic on the predicted toxic word probability. This threshold is tuned on the trial dataset. For SpaCy, the en core web sm model is used with 30 iterations. For all BERT-based models, we use Hugging-Face's transformers (Wolf et al., 2020) in PyTorch. For CRF, we use the pytorch-crf (Kurniawan, 2018) library. We use a batch size of 4, train for 3 epochs, use a linear learning rate decay, and an AdamW optimizer with a weight decay of 0.01. The initial learning rate is 2e−5. During tokenization, the maximum length allowed is 384, with the exception of RoBERTa Span+Token where it is 512. We use LARGE models for all -BERT, RoBERTa and SpanBERT, unless otherwise specified.
For Token Classification, we add a label for the [CLS] token if the percentage of toxic offsets in text is greater than 30% in order to provide a proxy text classification objective for the system. For span-based models, the K used for top-K start and top-K end logit selection is 20, and the maximum allowed answer length is 30 tokens. For LSTM-CRF systems, a dummy label is used for the [CLS] token, while the prediction mask for other special tokens is set to 0. A dropout of 0.2 is used. For Span Prediction systems, the overlapping stride is set to 128.
The training dataset used is tsd train.csv and the dev set used is tsd trial.csv file, unless otherwise specified. For all systems, we evaluate the F 1 scores using the provided script on the checkpoints which give the lowest dev set loss.
In Table 2, we mention scores for our approaches. The scores are evaluated are performed after the evaluation phase, using the hyperparameters mentioned in Section 4.2. We observe that the highest score is obtained by SBT-TC (0.6856). The baseline scores (RNNSL/SpaCy) are good (≈0.65) considering that these models are not pre-trained. Notably, SP systems perform worse than their TC counterparts. A good reason could be the selfattention used in BERT-based models. Since the interaction is between tokens, and not spans, it is expected that each token is well represented and less consideration will be given to the span representation around a single token. The reason why SBT-TC performs best out of all the LARGE models could be the random-spans Masked Language     (Table 3), we get out best scoring system -RBTa-TC(3,∩) -which achieves a score of 0.6895. However, our best official submission 7 was a variant of the third best combination -RBTa-TC(3,∪)∩SBT-SP (0.6765). It is also observed that intersection ap-proaches perform better than corresponding union and single checkpoints approaches, while union approaches perform worse than single checkpoints. This means that the individual checkpoints are predicting some extra offsets to be toxic.  In Table 4, we present results on TBT 8 and TRBTa 9 for TC and SP approaches. These are BASE models fine-tuned on the Civil Comments Dataset. Since the Toxic Spans dataset has similar text data, we expect these models to perform better than BASE models. We observe that TBT-TC and TRBTa-SP perform slightly better than BT-TC and RBTa-SP, despite being BASE models. Also, BT-SP and RBTa-TC are only slightly better than their 'Toxic counterparts. Yet, in comparison, BASE models -BT-B and RBTa-B, without any multi-stage pre-training perform better than their 'Toxic' counterparts, and are comparable, if not better than their LARGE counterparts. This means that there not enough data for LARGE models, and hence, they tend to overfit. However, the reasons behind worse performance of 'Toxic' systems is unclear. We also evaluate scores for a few systems on the test set after 3 epochs of training on both train and trial data (-TT). We observe that the performance on both train and trial datasets increases significantly (≈7-10%), showing that these datasets have similar distribution. However, the performance on test decreases for RBTa-TC-TT and RNNSL-TT in comparison to the Table 2, which shows that test set distribution might be slightly different for TC task. For SBT-SP-TT, we see a slight increase, showing scope of improvement for SP systems with more data. Lastly, we evaluate the token-based predictions and span-based predictions for SBT SP-TC separately. Surprisingly, token predictions achieve a F 1 score of 0.6522 on the test set, which is much better than using both token and spans (0.5959). However, for span-based predictions, we only achieve an F 1 score of 0.1510. This means that the system is focusing heavily on token-based-predictions. Hence, we need to re-evaluate our architectural decisions in order to successfully incorporate both token and spans together.

Conclusion
Based on our results and analysis, we conclude that Token Classification systems have an edge over Span Prediction methods on this task. BASE models perform better than LARGE models in either of the approaches, which could imply need for more data to train LARGE models. Our Multi-Span approach performs poorly, but Span+Token approach shows some promise and we need to re-evaluate our architectural choices. The reason why Toxi-cBERT/ToxicRoBERTa perform worse than BASE models is also an avenue for further analysis. Finally, our individual BERT-based models tend to predict extra offsets for the task. While checkpoint ensembling using intersection is a good way to address this issue, we will explore other remedies in a future work.

A Official Submissions
During the evaluation period, we performed a 'cleaning' of the data by removing starting/trailing whitespace and punctuation characters in spans. Additionally, we include those partial words in spans which had more than half the number of characters in the span, and discard remaining partial words from spans. We considered this version of the tsd train.csv and tsd trial.csv to be 'clean train' and 'clean trial', respectively. During the post-eval period, we found out potential issues with the cleaning, and thus, we use original files. Additionally, since the distribution of tsd test.csv is expected to be similar to tsd train.csv and tsd trial.csv, the scores are much better for models trained on tsd train.csv file instead of clean train.csv. However, some of our official submissions were from systems trained on the 'clean train' data. Keeping that in mind, we report our official scores for our top-few approaches in Table 5.

B Integrated Gradients
We use Integrated Gradients (Sundararajan et al., 2017) from the Captum(Kokhlikyan et al., 2020) library for qualitative analysis of predictions for the SpanBERT-SP, and the RoBERTa-TC models. We calculate Integrated Gradients of the targets with respect to the embedding layer outputs. The Riemann Right numerical approximation method is used, with n steps=50. Following Ramnath et al. (2020), we calculate token-wise importance distributions and word-wise distributions for a few examples. We refer the paper to the reader for more details.
For the Token Classification model, the targets are softmax outputs of toxicity logits of those tokens which the model predicts to be toxic, with a score greater than 0.5. For all such toxicity logits as targets, we calculate attributions with respect to the embedding layer outputs for all the tokens, and average them to get token-wise importance scores. For the Span Prediction model, we find start and end indices for all the predicted spans, and calculate respective attributions, add them, and then average them to get token-wise importance scores.
Text: offense See a shrink you pathetic troll .

Ground Spans: [ 'pathetic troll' ] Predicted Spans: [ 'pathetic troll' ]
Create PDF in your applications with the Pdfcrowd HTM We observe in Figure 2a that the Span Prediction model performs correct prediction. However, on average, the word 'shrink' gets higher importance than 'pathetic troll'. This is in contrast with Figure 2b where the Token Detection model misses out on space (because it only considers tokens) and focuses more on the words 'pathetic', 'troll'. However, the word 'shrink' seems to be important in both cases. This means that while Token Classification models perform better, there are cases which are missed by these approaches. Additionally, some words outside of the span may contribute to toxicity of a particular span. We will be analyzing such words in a future work.

C Model Predictions
The predictions of the various systems for one example that is present in the test set, are listed in Table 6. The examples provide the following intuition about the data and the systems: • The spaces in between the words are, predictably, ignored by the the token based models. Moreover, the conjunctives like 'and' are ignored as well. This means that additional post-processing of the data will lead to improvements in performance of token classification systems.
• Sometimes, random words like 'go' and 'on' are selected to be toxic, which means that these types of prepositions and verbs can be removed by exact matching in the string, unless they form parts of larger spans.
• The best checkpoints of the span-based models tend to predict empty spans for the selected example. However, when using checkpoint ensembling, we see that union models return accurate spans.
• The ground spans are not entirely correct and are ambiguous. For example, it is not clear whether the word 'ignorant' should be considered to be toxic. The models, based on other examples, predict 'ignorant' to be toxic, but it is not present in the ground spans. This means that finding the toxic spans is not a trivial task for humans, and annotation can not be performed easily by crowd-workers.
• In some cases, one of the occurrences of the word 'ignorant' is considered to be toxic, while the other is predicted to be benign. The first instance of 'ignorant' does not seem to be as toxic as the second instance and therefore, more analysis needs to be done to determine the 'degree' of toxicity of the spans. This can be a good direction for future research.