LZ1904 at SemEval-2021 Task 5: Bi-LSTM-CRF for Toxic Span Detection using Pretrained Word Embedding

Recurrent Neural Networks (RNN) have been widely used in various Natural Language Processing (NLP) tasks such as text classification, sequence tagging, and machine translation. Long Short Term Memory (LSTM), a special unit of RNN, has the benefit of memorizing past and even future information in a sentence (especially for bidirectional LSTM). In the shared task of detecting spans which make texts toxic, we first apply pretrained word embedding (GloVe) to generate the word vectors after tokenization. And then we construct Bidirectional Long Short Term Memory-Conditional Random Field (Bi-LSTM-CRF) model by Baidu research to predict whether each word in the sentence is toxic or not. We tune hyperparameters of dropout rate, number of LSTM units, embedding size with 10 epochs and choose the best epoch with validation recall. Our model achieves an F1 score of 66.99 percent in test dataset.


Introduction
Detecting toxic words plays a critical role in social media to ensure healthy online discussions. In previous study, some tasks (Liu et al., 2019;Borkan et al., 2019a) only identify offensive language based on the whole sentence or post. Most of them do not detect specific spans of words that make the sentence or post offensive.
In SemEval-2021 Task 5: Toxic Spans Detection (Pavlopoulos et al., 2021), the data was collected from civil comments (Borkan et al., 2019b). Each post is in string format, and a word is marked as toxic span in the form of its characters' offsets in the string. The goal of the task is to classify whether each word in a sentence is toxic or not. If so, the indices of characters in the word should be returned. The task is evaluated by F1 score based on the character offsets among all posts. The challenges of this task include: • The small dataset makes it very difficult to train complicated models like deep neural networks, since it may cause overfitting.
• We need to predict which word or phrase is toxic given a text (many-to-many) rather than whether the entire sentence is offensive or not (many-to-one). This creates restrictions on feature engineering and modeling: -Feature Engineering: We cannot delete or add words in the sentence. -Modeling: Models need to be specific on each word instead of sentiment classification on whole sentence.
• Most of the words and phrases in sentences are not toxic. This indicates our dataset is imbalanced.
The models we explore in this task include wordbased Conditional Random Field (CRF), wordbased Bidirectional Long Short Term Memory (Bi-LSTM) with and without pretrained word embedding, Bidirectional LSTM-CRF with pretrained word embedding. We choose Bi-LSTM-CRF as final submission, since it performs the best during our experiments.
The structure of this paper is organized as follows: • In section 2, we review related work of applications of different models in Sequence Tagging, Name Entity Recognition, and Sentiment Analysis.
• In section 3, we present the summary statistics of data and how we build models with performance evaluation.
• In section 4, we discuss our model results on validation dataset with key findings. • In section 5 and 6, we present our conclusion based on experiment results and future work.

Related Work
Toxic Span Detection is a type of sequence labeling tasks and is similar to name entity recognition tasks with only two categories. (Xing et al., 2010) surveyed various sequence labeling tasks in terms of methodologies and applications. They also reviewed a few extensions of sequence classification including early classification and semi-supervised learning on sequences. (Nguyen and Guo, 2007) compared different learning algorithms such as Conditional Random Fields (CRF), Support Vector Machine (SVM), and Perceptron for sequence labeling tasks. (Akbik et al., 2018) proposed a new type of embedding called contextual string embedding for sequence labeling tasks. A Comparison between multiple word embedding methods were conducted by (Lauren et al., 2018) for sentiment classification and sequence labeling tasks. For name entity recognition (NER), (Mansouri et al., 2008) presented a machine learning based approach called Fuzzy Support Vector Machine. Recent advanced deep learning models were summarized by (Yadav and Bethard, 2019) in various shared tasks.

Data Description
Toxic Spans Detection Dataset includes trial and training data. The training data contains 7939 records, and trial data contains 690 records. The sentences and indices of characters in toxic span are provided separately. To obtain if a word is toxic or not after tokenization, we split the text by space and punctuation and map the indices of toxicity to corresponding words. As a result, the word sequences in text will be marked as toxic (1) or non-toxic (0).
The distribution of the length of text (in word count) is summarized in Table 1. It shows that training and trial data follow a similar distribution in percentile, mean, and standard deviation (std).
The count of toxic and non-toxic words in texts are concluded in Table 2. It shows the dataset is highly imbalanced that most of words are non toxic.

Methodology
CRF Conditional Random Field (CRF) was developed in 2001 (Lafferty et al., 2001) for sequence prediction. Given the observable sequence X and labeling sequence Y , the objective of CRF is to construct model for conditional probability P (Y | X). An advantage of using CRF compared with other sequential models such as Hidden Markov Model (Rabiner and Juang, 1986) is that it does not rely on the assumption of label independence.  Before fitting the model, for each word we create binary features to check whether the word is uppercase, lowercase, titlecase, and digit. We also append the same features of previous and next words. Next, we use "crfsuite" package to build CRF model. We choose "lbfgs" as optimization algorithm, and we set c1 and c2 equal to 0.1, max iteration equal to 100.
Bi-LSTM Long Short Term Memory (LSTM) is one of the most commonly used recurrent neural networks in many natural language processing tasks (Hochreiter and Schmidhuber, 1997). It consists of input gate, output gate, forget gate, and cell. Its gate structure enables the model to memorize long-term dependency and to prevent gradient vanishing issues.
In the experiments, we use Bidirectional LSTM (Bi-LSTM) in tensorflow.keras as second baseline. We configure number of LSTM units to be 200, embedding size equal to 50, and max sequence length to be 240, which is the max sentence length in training dataset. Thus, sentences with length of less than 240 will be padded. To reduce overfitting of neural networks, we set dropout rate as 0.2. Our final output layer uses sigmoid activation function with adam optimizer for gradient descent (Kingma and Ba, 2014).
We set number of epochs equal to 10 and record checkpoints. Since tensorflow package does not contain built-in F1 score, the final model parameters are loaded from the checkpoint with highest validation recall.

Bi-LSTM with pretrained word embedding
To further improve model performance, we adopt pretrained word embedding to generate word representation before training Bi-LSTM. The word embedding we use includes glove-twitter-50 and glove-twitter-100 in gensim (Řehůřek and Sojka, 2010). This means we need to modify our embedding size to 50 and 100 respectively. All other hyperparameters are consistent with Bi-LSTM above.
Bi-LSTM-CRF with pretrained word embedding Our final model is Bidirectional LSTM-CRF created by Baidu Research (Huang et al., 2015). Compared with previous Bi-LSTM architecture, we add an extra output layer of CRF to make final predictions (as shown in Figure 1). Accordingly, we replace the loss function of binary cross entropy by CRF loss. We use glove-twitter-100 as pretrained embedding layer. All other hyperparameters of LSTM remain the same.

Experiment Results
We split proportion of training and validation dataset into 9:1. The evaluation results on validation set are summarized in Table 3. For each model discussed above, we list its precision, recall, and F1 score with default threshold. Since the dataset is highly imbalanced, we only focus on the evaluation metrics of toxic words. There are several key findings: • Models with pretrained word embedding perform better than those without pretrained word embedding, since it produces higher precision and recall (thus higher F1 score).
• The performances of pretrained word embedding are close to each other regardless of embedding size. We do not want to further increase the embedding size, since it will increase the training time but not boost the performance significantly.
• As a final output layer, CRF can further improve recall for Bi-LSTM while keeping the precision in the same level. Therefore, it can increase F1 score.
Based on the model evaluation table, we choose Bi-LSTM-CRF with pretrained glove-twitter-100 embedding as our final model. The model achieves an F1 score of 0.6699 in final submission.
The confusion matrix for test data can be found in Table 5. We first flatten the sequence to a list of

Texts
Predictions Error Type Chris Birch is a mean, self-centered, contrary ass. ... always sucks up to Big Oil.
[ass, sucks] False Positive I wish this moron would have been shot to death by the US soldier instead of the other way around.
[moron] False Positive Lord have Mercy on us, Trump is running amok.
[] False Negative   Table 4. In false positive examples marked as underline, the words "ass", "sucks", and "moron" are predicted as toxic words where there exists no toxicity in these sentences. In false negative examples marked as bold, the model fails to identify toxic words like "amok", "vandals", and "thieves". The errors may come from the following reasons: • Incorrect labels by ground-truth spans. These errors are unavoidable from the model due to human mistake.
• The pre-trained word embedding from GloVe does not reflect sentiment for those words. In other words, these words are not marked as positive or negative but neutral in word embedding.
• The position of words in sentence was not detected as toxicity by our Bi-LSTM-CRF model. For example, one word could be marked as toxic spans when it is in the beginning of the sentence but not the case when it is at the end. This will cause difficulty for model training to detect toxicity.

Conclusion
Detecting toxic words in texts is critical to furnish a healthy environment on social media. Sequence labeling task for finding specific offensive words is more difficult than sentiment classification on sentence level, since it requires models to locate the positions or indices of words in sentences. In addition, the task also places restrictions on feature engineering, because we cannot delete or add words in sentences. Our experiment shows pretrained word embedding can improve model performance compared with randomized embedding weights. This verifies the concept of transfer learning where we can borrow the outputs from other resources and use them as inputs to achieve specific goals. Another finding is the benefit of model stacking where we add an extra layer of CRF after Bi-LSTM that further enhances predictability. In such case, when a single model does not work well in NLP tasks, combining different models with pretrained word embedding can be a good option to explore. However, there are still a lot of false positive examples in test set where the model predicts toxic words that in fact are not toxic.

Future Work
Further improvements can focus on feature engineering and model implementation. For feature engineering, we can conduct data augmentation for false negative examples: We first collect the words that are predicted as non-toxic but actually toxic, and reconstruct sentences using those toxic words as more training samples. This method can increase the weights of words that were originally omitted by the model, so that it may return better results. Similarly, we can also collect false positive examples and perform data augmentation to reduce false positive rates.
In addition to data augmentation, one can perform word-level text normalization to transform words of different tenses to the same, even though each word cannot be deleted or added in a sentence.
From the model perspective, we may consider using more advanced classifiers with complicated structures. Due to resource limitations, we cannot design any large neural networks models such as deep neural networks (DNN) or transformers. Most of our experiments are done locally or via Google Colab. Training large neural networks will be very time-consuming and expensive when using tremendous amount of computing resources including multiple GPUs, TPUs.
If we have more time and available resources, we can experiment with more complex models such as BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and so forth. In addition, we can deploy larger LSTM-related architecture including Bi-LSTM-CNN-CRF for sequence labeling (Ma and Hovy, 2016).