Cisco at SemEval-2021 Task 5: What’s Toxic?: Leveraging Transformers for Multiple Toxic Span Extraction from Online Comments

Social network platforms are generally used to share positive, constructive, and insightful content. However, in recent times, people often get exposed to objectionable content like threat, identity attacks, hate speech, insults, obscene texts, offensive remarks or bullying. Existing work on toxic speech detection focuses on binary classification or on differentiating toxic speech among a small set of categories. This paper describes the system proposed by team Cisco for SemEval-2021 Task 5: Toxic Spans Detection, the first shared task focusing on detecting the spans in the text that attribute to its toxicity, in English language. We approach this problem primarily in two ways: a sequence tagging approach and a dependency parsing approach. In our sequence tagging approach we tag each token in a sentence under a particular tagging scheme. Our best performing architecture in this approach also proved to be our best performing architecture overall with an F1 score of 0.6922, thereby placing us 7th on the final evaluation phase leaderboard. We also explore a dependency parsing approach where we extract spans from the input sentence under the supervision of target span boundaries and rank our spans using a biaffine model. Finally, we also provide a detailed analysis of our results and model performance in our paper.


Introduction
It only takes one toxic comment to sour an online discussion. The threat of abuse and harassment online leads many people to stop expressing themselves and give up on seeking different opinions. Toxic content is ubiquitous in social media platforms like Twitter, Facebook, Reddit, the increase of which is a major cultural threat and has already lead to a crime against minorities (Williams et al., 2020). Toxic text in online social media varies depending on targeted groups (e.g. women, LGBT, gay, African, immigrants) or the context (e.g. protrump discussion or the metoo movement). Toxic Text online has often been broadly classified by researchers into different categories like hate, offense, hostility, aggression, identity attacks, and cyberbullying. Though the use of various terms for equivalent tasks makes them incomparable at times (Fortuna et al., 2020), toxic speech or spans in this particular task, SemEval-2021 Task 5 (Pavlopoulos et al., 2021), has been considered as a super-set of all the above sub-types. While a lot of models have claimed to achieve state-of-the-art results on various datasets, it has been observed that most models fail to generalize (Arango et al., 2019;Gröndahl et al., 2018). The models tend to classify comments as toxic that have a reference to certain commonly-attacked entities (e.g. gay, black, Muslim, immigrants) without the comment having any intention to be toxic (Dixon et al., 2018;Borkan et al., 2019). A large vocabulary of certain trigger terms leads to a biased prediction by the models (Sap et al., 2019;Davidson et al., 2017). Thus, it has become increasingly important in recent times to determine parts of the text that attribute to the toxic nature of the sentence, for both automated and semi-automated content moderation on social media platforms, primarily for the purpose of helping human moderators deal with lengthy comments and also provide them attributions for better explainability on the toxic nature of the post. This in turn would aid in better handling of unintended bias in toxic text classification. SemEval-2021 Task 5: Toxic Spans Detection focuses on exactly this problem of detecting toxic spans from sentences already classified as toxic on a post-level.
In this paper, we approach the problem of multiple non-contiguous toxic span extraction from texts both as a sequence tagging task and as a standard span extraction task resembling the generic approach and architecture adopted for single-span Reading Comprehension (RC) task. For our sequence tagging approach, we predict for each token, whether it is a part of the span. For our second approach, we predict and compute a couple of scores for each token, corresponding to whether that token is the start or end of the span. In addition to this, we deploy a biaffine model to score start and end indices, thus adopting the methodology for multiple non-contiguous span extraction.
Recent work shows transformer based architectures like BERT (Devlin et al., 2019) have been performing well on the task of offensive language classification (Liu et al., 2019a;Safaya et al., 2020;Dai et al., 2020). Transformer based architectures have also produced state-of-the-art performance on sequence tagging tasks like Named Entity Recognition (NER) (Yamada et al., 2020;Devlin et al., 2019; span extraction (Eberts and Ulges, 2019;Joshi et al., 2020) and QA tasks (Devlin et al., 2019;Lan et al., 2020). Multiple span extraction from texts has been explored both as a sequence tagging task (Patil et al., 2020;Segal et al., 2019) and as span extraction as in RC tasks (Hu et al., 2019;. Very recently HateXplain (Mathew et al., 2020) proposed a benchmark dataset for explainable hate speech detection using the concept of rationales.
Attempts have also been made to handle identity bias in toxic text classification (Vaidya et al., 2020) and also to make robust toxic text classifiers which help adversaries not bypass toxic filters (Kurita et al., 2019).

Methodology
For our sequence tagging approach, we explore two tagging schemes. First, the well known BIO tagging scheme, where B indicates the first token of an output span, I indicates the subsequent tokens and O denotes the tokens that are not part of the output span. Additionally, we also try a simpler IO tagging scheme, where words which are part of a span are tagged as I or O otherwise. Formally, given an input sentence x = (x 1 ,...,x n ), of length n,and a tagging scheme with |S| tags (|S| = 3 for BIO and |S| = 2 for IO), for each of n tokens the probability for the tag of the i-th token is where p ∈ R m×|S| , and f is parameterized function with |S| outputs.
Our other approach is based on the standard single-span extraction architecture widely used for RC Tasks. With this approach, we extract toxic spans from sentences under the supervision of target span boundaries, but with an added biaffine model for scoring the multiple toxic spans instead of simply taking top k spans based on the start and end probabilities, thus giving our model a global view of the input. The main advantage of this approach is that the extractive search space can be reduced linearly with the sentence length, which is far less than the sequence tagging method. Given an input sentence x = (x 1 ,...,x n ), of length n, we predict a target list T = (t 1 ,...,t m ) where the number of targets is m and each target t i is annotated with its start position s i , its end position e i and the class that span belongs to (only one in our case, toxic).
However, to adapt to the problem of extracting multiple spans from the sentence, instead of taking the top k spans based on the start and end probabilities, we apply a biaffine model (Dozat and Manning, 2016) to score all the spans with the constraint s i ≤ e i . Post this we rank all the spans in descending order and choose every span as long it does not clash with higher-ranked spans.

Dataset
The dataset provided to us by the organizers of the workshop consisted of a random subset of 10,000 posts from the publicly available Civil Comments Dataset, from a set of 30,000 posts originally annotated as toxic (or severely toxic) on post-level annotations, manually annotated by 3 crowd-raters per post for toxic spans. The final character offsets were obtained by retaining the offsets with a probability of more than 50%, computed as a fraction of raters who annotated the character offsets as toxic. Basic statistics about the dataset can be found in Table 1 Additionally, we provide a quick look into the length-wise distribution of spans across the train, development, and test set in Table 2. As we observe, the majority of the spans are just a single word in length and mostly comprise of the most commonly used cuss words in the English language. In our Results Analysis section, we show how this metric stands important for training and evaluating our systems and for the future development of toxic span extraction datasets.

Evaluation Metric
To evaluate the performance of our systems we employ Let system A return a set S t A of character offsets, for parts of the post found to be toxic. Let S t G be the character offsets of the ground truth annotations of post t. We calculate F 1 score of S t A w.r.t S t G as follows where |.| denotes set cardinality.
If predicted span i.e S t A is empty for a post t then we set F t 1 (A,G) = 1 if the gold truth i.e S t G is also empty, else if S t G is empty and S t A is not empty then we set F t 1 (A,G) = 0.
6 System Description 6.1 Sequence Tagging Approach For our sequence tagging approach we employ the commonly used BiLSTM-CRF architecture (Huang et al., 2015) used predominately in many sequence tagging problems, but with added contextual word embeddings for each word using transformer and character-based word embeddings. We experiment with a total of 5 transformer architectures, namely BERT (Devlin et al., 2019), XLNet , RoBERTa (Liu et al., 2019b), ALBERT (Lan et al., 2020) and SpanBERT (Joshi et al., 2020). For all of the above mentioned transformer architectures, the large variant of the transformer was used except ALBERT for which we use its xlarge-v2 variant. First, the tokenized word input is passed through the transformer architecture and the output of the last 4 encoder layers is concatenated to obtain the final contextualized word embedding E T for each word in the sentence. Additionally, we also pass each character in a word through a character-level BiLSTM network, to obtain character-based word embeddings for the word E C as used by Lample et al. (2016). Finally, both these word embeddings, E T and E C , for each word are concatenated and passed through a BiLSTM layer followed by a CRF layer to obtain the best probable tag for each word in the sentence.

Dependency Parsing Approach
For our dependency parsing approach, we employ a similar approach as proposed by , using a biaffine classifier to score our spans postextraction. This methodology fits best to our purpose of multiple toxic span extraction from sentences compared to span extraction systems in general RC tasks which are capable of extracting just a single span from a sentence (Yang and Ishfaq Figure 2: Sequence Tagger Model and character-based word embeddings. We used BERT Large for all our experiments and used the recipe followed by Kantor and Globerson (2019) to extract contextual embeddings for each token. After concatenating both the word embeddings and character embeddings for each word, we feed the output to a BiLSTM layer. We then apply two separate FFNNs to the output word representations x to create different representations (h s / h e ) for the start/end of the spans. These representations are then passed through a biaffine model for scoring all possible spans (s i ,e i ), where s i and e i are start and end indices of the span, under the constraint s i ≤ e i (the start of the span is before its end) by creating a l × l × c scoring tensor r m , where l is the length of the sentence and c is the number of NER categories + 1(for non-entity). We compute the score for a span i by: We finally assign each span a category y based on Post this, we rank each span that has a category other than non-entity and consider all the spans for our final prediction as long as it does not clash with higher ranked spans with an additional constraint, whereby, an entity containing or is inside an entity ranked before it will not be selected.

Experimental Setup
Data was originally provided to us in the form of sentences and the corresponding character offsets for the toxic spans of the sentence. Before converting the character offsets to our required format for our respective approaches, we apply some basic text pre-processing to all our sentences. First, we normalize all the sentences by converting all white-space characters to spaces. Second, we split all punctuation characters from both sides of a word and also break abbreviated words. These pre-processing steps help improve the F 1 score of both our approaches as shown in Table 6. Post these pre-processing steps, we formulate our targets for both our approaches. For our sequence tagging approach, we tag each word in the sentence with its corresponding tag based on the tagging scheme we follow, BIO or IO. For our span extraction approach, we convert the sequence of character offsets into its corresponding word-level start and end indices for each span. In Fig. 4, we provide a pictorial representation of the above mentioned procedures we follow for data preparation for both our approaches. We use PyTorch 1 Framework for building our Deep Learning models along with the Transformer implementations, pre-trained models and, specific tokenizers in the HuggingFace 2 library. We mention the major hyperparameters of our best-performing systems experimental setting for our dependency parsing approach and span extraction approach in Tables 3 and 4    We train all our sequence tagging models with 2 http://huggingface.co/ stochastic gradient descent in batched mode with a batch size of 8. In the training phase, we keep all layers in our model, including all the transformer layers trainable. We start training our model at a learning rate of 0.01, with a minimum threshold limit of 0.0001, and half the learning rate after every 4 consecutive epochs of no improvement in the F 1 score of the development set. We train our model to a maximum of 100 epochs or 4 consecutive epochs of no improvement at our minimum learning rate.
We train our our model for dependency parsing approach with Adam optimizer in batched mode with a batch size of 32 and a learning rate of 0.0001 for a maximum of 40,000 steps. With this approach too, we keep all layers trainable in the training phase except the BERT Transformer layers. Pretrained BERT and fastText embeddings were just used to extract context-dependent and independent embeddings respectively and BERT was not finetuned in the training phase.
The training was performed on 1 NVIDIA Titan X GPU. Our code is available on Github 3 .

Results
In Table 5 we present F 1 scores for all our systems trained for both our sequence tagging and span extraction approaches. For our sequence tagging approach, we divide our results according to the transformer architecture and tagging scheme used for that experiment.  Our best performing architecture proved to be the sequence tagging system with XLnet trans-former trained with IO tagging scheme. Additionally, in Table 6 we show how the LSTM and CRF over the transformer architecture , and our preprocessing step mentioned in Section 7 affect the performance of our best performing architecture.

F1
∆ Our Model 0.6922 --LSTM 0.6912 0.0010 -CRF 0.6850 0.0072 -Pre-processing 0.6759 0.1630 Table 6: Impact of LSTM, CRF and pre-processing on learning 9 Results Analysis 9.1 Length vs Performance We wanted to understand how the performance of the system varied with varying lengths of spans. Table 7 summarizes the performance of our best performing systems on all approaches experimented by us, on the test dataset spans, divided into 3 sets according to their length in terms of the number of words that help to make the span.

Model
Span

Learning context
Majority of single word spans in the dataset are the most commonly used cuss words or abusive words in the English language, i.e., words that can be directly classified as toxic and are not contextdependant, e.g. "stupid","idiot" etc., with spans longer than a single word having a lesser ratio of such words. We acknowledge the fact that an AIbased system should be able to do much more, like learning the context behind which a word is used, than just detect common English cuss words from a sentence, which can be otherwise done by a simple Figure 5: Toxicity classification of the word "black" in toxic and non-toxic context dictionary search. The deteriorating performance of the model with an increase in span length makes us dig deeper into our test set results to find out if our model is being able to detect context-based toxic spans from sentences. We follow a two step procedure to analyze this. First, we calculate our model performance on single-word spans consisting of just the top 25 most commonly occurring context-independent cuss words 4 . Table 8 shows an analysis of these results. Second, we take the word "black" and analyze two sentences in our test where the word black was mentioned in a toxic and non-toxic context. Fig. 5 shows how our model indeed tags the latter black as toxic and the former one as non-toxic.
Single Word Cuss Spans Others 0.6894 0.1736 Table 8: F 1 score of context independent cuss words 10 Conclusion In this paper, we present our approach to SemEval-2021 Task 5: Toxic Spans Detection. Our best submission gave us an F 1 score of 0.6922, placing us 7 th on the Evaluation Phase Leaderboard. Future work includes independently incorporating both post level and sentence level context for determining the toxicity of a word, and also collating a dataset with toxic spans comprising of a healthy mixture of simple cuss words (which can always be attributed as toxic independant of the context) and words for which the toxicity of the word depends on the context in which it appears, thereby making better systems towards contextual toxic span detection.