SINAI at SemEval-2021 Task 5: Combining Embeddings in a BiLSTM-CRF model for Toxic Spans Detection

This paper describes the participation of SINAI team at Task 5: Toxic Spans Detection which consists of identifying spans that make a text toxic. Although several resources and systems have been developed so far in the context of offensive language, both annotation and tasks have mainly focused on classifying whether a text is offensive or not. However, detecting toxic spans is crucial to identify why a text is toxic and can assist human moderators to locate this type of content on social media. In order to accomplish the task, we follow a deep learning-based approach using a Bidirectional variant of a Long Short Term Memory network along with a stacked Conditional Random Field decoding layer (BiLSTM-CRF). Specifically, we test the performance of the combination of different pre-trained word embeddings for recognizing toxic entities in text. The results show that the combination of word embeddings helps in detecting offensive content. Our team ranks 29th out of 91 participants.


Introduction
The advance of online communication has increased the use of offensive or toxic language in several websites, including social networks such as Instagram, Twitter, or YouTube. Consequently, this type of prejudiced communication could lead to negative psychological effects among Internet users, causing anxiety, harassment, and even suicide in extreme cases (Hinduja and Patchin, 2010).
Moderation is essential to promote healthy online communication. Therefore, governments, online communities, and social media platforms are continuously taking appropriate actions to implement laws and policies combating toxic language on the Web. In order to help to track this type of comments and due to the amount of data generated every day on the Web, automatic systems based on Natural Language Processing (NLP) techniques are required. In particular, offensive language detection and analysis has become an important area of research in NLP, resulting in several studies that are contributing to combating this website phenomenon (Plaza-del Arco et al., 2019;Zampieri et al., 2019a;Ranasinghe et al., 2019;. In this paper 1 , we present our proposal system as part of our participation in SemEval-2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021), which aims to identify entities that refer to a toxic language in the text. To accomplish the task, our team focused on detecting specific types of toxic entities in the text using a methodology based on the BiLSTM-CRF model showing that the combination of different pre-trained language embeddings succeeds in detecting toxic entities.
The rest of the paper is structured as follows. In Section 2 some previous related studies are introduced. In Section 3 we explain the data used in our methods and we describe the architecture of our proposed system to the Toxic Spans Detection task. In Section 4 we discuss the analysis and evaluation results for the experiments we performed. Finally, we conclude in Section 5 with remarks and future work.

Related work
Heretofore, several shared tasks have been organized in the NLP field to detect offensiveness on the Web for different languages. For instance, the well-known offensive language task OffensEval has held two editions in the International Workshop on Semantic Evaluation (SemEval) (Zampieri et al., 2019b,a) introducing as the main novelty in the second edition a multilingual dataset comprising 5 languages. The GermEval shared task focused on the identification of offensive language in German tweets and comprised two tasks, a coarse-grained binary classification task and a fine-grained multiclass classification task (Wiegand and Siegel, 2018). For Spanish, as far as we know, the first task on offensive language appeared at the 3rd SEPLN Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval) (Carmona et al., 2018) whose goal was to detect aggressiveness Mexican Spanish Tweets.
As a result, most of the studies and resources in offensive language research have been developed specifically for binary and multi-class classification tasks (Ranasinghe et al., 2019;Plaza-del Arco et al., 2019. However, other tasks such as Named Entity Recognition (NER) play an important role in this research and are essential to identify the entities that make a text toxic. Highlighting these toxic spans can help human moderators to interpret and identify easily this type of content on the Web instead of relying on a system that generates a score of unexplained toxicity per post. NER aims to identify and classify named entities mentioned in unstructured text into predefined categories. The earliest systems developed for addressing this task did not use training data but worked based on handcrafted features, heuristics, and a set of rules (Nadeau and Sekine, 2007;Collins and Singer, 1999;López-Ubeda et al., 2018). However, the cost of manual feature tagging and the poor obtained results lead to deep learning-based techniques as the most suitable choice to tackle the task by discovering patterns and learning the features in an end-to-end manner (López-Úbeda et al., 2020). Existing state-of-the-art approaches for sequence labeling have proven that Recurrent Neural Networks (RNNs) are capable of learning useful representations automatically as they enable the modeling of long-distance dependencies between words in a sentence (Limsopatham and Collier, 2016;Wintaka et al., 2019). Inspired by these studies, we have developed a system based on BiLSTM-CRF model along with the combination of different types of word embeddings to address the toxic spans detection task in text.

Named Entity Recognition Methodology
To address the toxic detection task, we focus on recognizing and extracting specific types of toxic enti-ties in the text. Specifically, we follow a methodology proposed by (Huang et al., 2015) implementing a BiLSTM-CRF model for the NER task.

Word Embeddings
As input layer of the BiLSTM-CRF neural network we have combined the following word embeddings: • Static Word Embedding. We use GloVe embeddings which are static and word-level, i.e. each distinct word gets exactly one pre-computed embedding. This type of embeddings is context-independent (Pennington et al., 2014).
• Contextual Word Embedding. For our experiments, we tested two different contextual pre-trained word embeddings: BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) and XLM (Lample and Conneau, 2019). Unlike the previous ones, they are context-dependent which means they produce word representations that are dynamically informed by the words around them. They are based on the well-known Transformer (Vaswani et al., 2017), an attention mechanism that processes the entire text input simultaneously to learn contextual relations between words (or sub-words). Specifically, we used the xlm-mlm-en-2048 model and the bert-base-cased model provided in HuggingFace (Wolf et al., 2019).

BiLSTM-CRF architecture
We use the combination of bidirectional LSTM and CRF to identify the toxic spans. The context of each word in the sentence is captured by the BiLSTM and then the predictions on the entities are simultaneously performed in the CRF layer (Sutton and McCallum, 2006). The architecture of BiLSTM-CRF model is illustrated in Figure 1. This architecture follows a sequence of layers as follows: • Embedding layer. Each word of the sentence is mapped to a vector of concatenated embeddings. As mentioned above, in our experiments, we use XLM, BERT, and GLOVE embeddings.
• BiLSTM layer. A bidirectional LSTM recurrent network takes as input the embeddings. In sequence tagging tasks, for a specific time  frame, this layer enables the hidden states to capture both historical and future context information and then to label a token.
• CRF layer. It allows to use efficiently historical and future tags to predict the current tag.

Dataset preprocessing
We use the English dataset provided by the organizers in SemEval 2020 Task 5: Toxic Spans Detection. The dataset is split into three different subsets: train, trial, and test, consisting of 7,939, 690, and 2,000 instances, respectively. Each instance in the dataset comprises two fields, the text and a list of toxic spans. A toxic span is defined as a sequence of characters in words that attribute to the text's toxicity. If the text does not contain toxic spans, the span list is empty. An example of two instances in the dataset is provided in Table 1. In the first example, the word "crap" is labeled as toxic in the text, which has character offsets from 15 to 18. The second example includes the toxic span "idiot" which has character offsets from 4 to 8.

Text Spans
What a load of crap.
[15, 16, 17, 18] You idiot. The media went to war against truth. [4,5,6,7,8]  To perform our experiments, we preprocess the subsets of the dataset in the following way. First, we used the nltk.tokenize package 2 to tokenize the text. Then, we generated the following features for each text in the subset: the word, the position of the beginning and end of the word in the text, and the NER tag. In order to perform the NER tagging, we follow the BIO annotation scheme to label multi-token named entities (Ratinov and Roth, 2009), which represents that the label is the beginning of a span (B-Toxic), inside the span (I-Toxic), or belongs to no span (O). This scheme is the most popular in the NER task. Figure 2 shows an example of the features generated for the following example in the training set: "How fucking stupid are you?", spans: [4,5,6,7,8,9,10,12,13,14,15,16,17].

Experiments
During the pre-evaluation period, we trained our models on the train set and evaluated our different approaches on the trial set. During the evaluation period, we trained our models on the train and trial sets and tested the model on the test set. Flair's framework (Akbik et al., 2019) builds directly on Pytorch was used to design the BiLSTM-CRF network. We used the default hyperparameter setting in Flair with the following configuration: learning rate as 0.1, batch size as 32, dropout probability as 0.01, and maximum epoch as 300. All experiments (training and evaluation) were performed on a node equipped with two Intel Xeon Silver 4208 CPU at 2.10GHz, 192GB RAM, as main processors, and six GPUs NVIDIA GeForce RTX 2080Ti (with 11GB each).
Our team (SINAI) submitted 4 runs for the Toxic Spans Detection task and each run evaluates the word embeddings as an input to the BiLSTM-CRF network, as explained in Section 3.

Results
In this section, we present the results obtained by our proposed system. In order to evaluate them, we use the official competition metric F1-score.
The results of our participation in the Toxic Spans Detection task during the evaluation phase are shown in Table 2. In particular, we list the performance of the four runs submitted using the BiLSTM-CRF model along with the combination of different word embeddings. If we analyze the results of the first 3 runs (each embeddings independently), we notice that they sightly differ, the best result is achieved by the contextual embedding XLM. However, training the model on the combination of static and contextual embeddings (GloVe, BERT, and XLM) leads to enhanced performance with a 0.6727 F1-score. Therefore, our results show the success of the combination of embeddings we chose to solve the task of toxic spans detection in comments using the proposed model.   Table 3 shows the official rank in the competition. As we can see, we are ranked 29th out of 91 participating teams obtaining an F1-score of 0.6727 with our system. The best result was obtained by the team HITSZ-HLT with an F1-score of 0.7083, which differs from our results achieved by 3.56%. In general, low results for the task are obtained which shows the Toxic Spans Detection as a challenge to be addressed by the NLP community and, therefore, further research is needed to advance on this specific task. We also observe that the number of participants in this task is high (91) which shows the importance and interest of the NLP community in contributing to addressing this challenge.

Conclusions and Future Work
This paper presents the participation of the SINAI research group in Task 5: Toxic Spans Detection at SemEval 2021.
In this paper, we use a deep learning-based approach for NER to identify spans that make a text toxic, which focuses on the use of a BiLSTM-CRF neural network where different word embeddings are tested. The model is trained on the dataset provided by the organizers of the task (Pavlopoulos et al., 2021) and preprocessing techniques are carried out to tokenize and tagged the dataset by using the BIO scheme.
Our results show that the sophisticated BiLSTM-CRF architecture which has been successfully used for other tasks such as biomedical entity recognition or part-of-speech tagging, but also achieves remarkable results when addressing tasks related to the identification of offensive language in comments. Besides, we find that this architecture with our proposed combination of embeddings for word representation provides useful insights for the learning phase of the neural network achieving better results than training the network with a single type of word embedding.
For future work, we plan to study the performance of our proposed method using a variety of linguistic features, including emotions that are inextricably linked to offensive language.