Sonal.kumari at SemEval-2020 Task 12: Social Media Multilingual Offensive Text Identification and Categorization Using Neural Network Models

In this paper, we present our approaches and results for SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The OffensEval 2020 had three subtasks: A) Identifying the tweets to be offensive (OFF) or non-offensive (NOT) for Arabic, Danish, English, Greek, and Turkish languages, B) Detecting if the offensive tweet is targeted (TIN) or untargeted (UNT) for the English language, and C) Categorizing the offensive targeted tweets into three classes, namely: individual (IND), Group (GRP), or Other (OTH) for the English language. We participate in all the subtasks A, B, and C. In our solution, first we use the pre-trained BERT model for all subtasks, A, B, and C and then we apply the BiLSTM model with attention mechanism (Attn-BiLSTM) for the same. Our result demonstrates that the pre-trained model is not giving good results for all types of languages and is compute and memory intensive whereas the Attn-BiLSTM model is fast and gives good accuracy with fewer resources. The Attn-BiLSTM model is giving better accuracy for Arabic and Greek where the pre-trained model is not able to capture the complete context of these languages due to lower vocab-size.


Introduction
The usage of offensive language in social media is very common nowadays. Sometimes it is used for fun and entertainment purposes, but sometimes it is an expression of user's aggression, hate, and/or offensive behavior. Identification of user's intention with an offensive content on social media requires domain-specific sentiment analysis at fined-grained level of granularity. To control the offensive behavior of the user's post on social media requires the categorization of hate speech problems into new notions like abusive, aggressive, and/or offensive hate speech. Such categorization helps law-enforcement agencies for the surveillance of social media.
Identification of hate, aggression, or offensive speech in user-generated content has attracted significant attention in the sentiment analysis research area recently. As evidenced, in recent publications (Waseem et al. 2016;Davidson et al., 2017, Malmasi and Zampieri, 2018, Kumar et al. 2018) and workshops such as AWL and TRAC and competitions such as HatEval 2019 (Basile et al. 2019), HASOC 2019, and OffensEval 2019 (Zampieri et al. 2019).
The exponential rise of users on social media and their posts on social media led to an enormous amount of data generation. Detection of hate, offensive speech, aggression, or abusive content on social media requires the involvement of algorithms to make decisions based on deep learning models such as LSTM, Bidirectional LSTM, pre-trained BERT (Bidirectional Encoder Representations from Transformers), CNN, or Stacked CNN, but not limited.
The OffensEval-2020 has been introduced as a Multilingual Offensive Language Identification by   We participate in all the subtasks A, B, and C. Our approach is based on a pre-trained BERT model (Devlin et al., 2018) and Attention aware BiLSTM model (Att-BiLSTM) (Zhou et al., 2016). First, we clean the given tweet data and fine-tune the pre-trained BERT model for all the subtasks, A, B, and C. Next, we apply the BiLSTM model for the same. We observe that the BiLSTM model accuracy is better compare to the BERT model Arabic-A and Greek-A, and also, the BERT model is compute-intensive and memory-hungry compare to the BiLSTM model.
The rest of this paper is organized as follows. In Section 2, related work has been presented. Section 3 and Section 4 give the Methodology and Experimental analysis. Finally, Section 5 concludes the work.

Related Work
Recently many research works (Schmidt and Wiegand, 2017;Malmasi and Zampieri, 2017;Gamback and Sikdar, 2017;Fortuna and Nunes, 2018;Zhang et al. 2018;Basile et al., 2019;Liu et al., 2019) have been done for Artificial Intelligence-based offensive language detection in English text. These works can be classified into following four categories: 1) Convolutional Neural Network (CNN) based (Gamback and Sikdar, 2017;Seo et al. 2020), 2) Recurrent Neural Network (RNN) based Seo et al. 2020), 3) Combination of CNN & RNN (Zhang et al. 2018), and 4) other approaches (Malmasi and Zampieri, 2017;Basile et al., 2019;Liu et al., 2019). Liu et al. (2019) applied a pre-trained BERT model (Devlin et al., 2018) to fine-tune the binary offensive language detection task and demonstrated outstanding accuracy.  demonstrated that a bidirectional Gated Recurrent unit network applied to word-level features generated from Hate Speech Detection data having four categories of class-label outperforms. Seo et al. (2020) evaluated three CNN based and five RNN based deep learning models on 13 review datasets for both word-level and character-level input structures. They demonstrated that BiLSTM based model with word-level performs best across various benchmarked review datasets of the English language.
For multilingual abusive language detection, less work has been done which focuses Hindi, & English in (Kumar et al. 2018 (Bosco et al., 2018), Spanish (Carmona et al., 2018), etc. Despite lots of research done for abusive language identification, most of these works focused on English Language and the rest of the languages are still not much explored in this field.
Moreover, the RNN model is a default choice for most of the text analytics-related applications as it outperforms over rest but its recurrent structure creates hurdle in learning long-term dependency because of vanishing or exploding gradient problem. Hochreiter and Schmidhuber (1997) introduced LSTM to avoid vanishing gradient problem and learn long-term dependency in the sequence model by inserting a gate Unit. BiLSTM with Attention (Zhou et al., 2016) has been proposed for relation classification which has been shown to outperform SVM, CNN, RNN, and LSTM models. From the literature, it is clear that the BERT model and the BiLSTM based model outperform the rest of the AI models for sentiment analysis Liu et al., 2019;Seo et al., 2020).
We choose, two model architectures for classification tasks for the given data sets: pre-trained BERT Model (Devlin et al., 2018) and BiLSTM with Attention (Att-BiLSTM) (Zhou et al., 2016). Besides, we also train a unified model that uses BiLSTM with Attention which is explained in subsection 3.3.

BERT-multi-cased (BERT_MC)
The BERT is a pre-trained bidirectional Transformer Encoder stack that is trained on large plain text corpus (Devlin et al. 2018). There are two multilingual BERT (Bidirectional Encoder Representations from Transformers) models each consist of an embedding layer with 12 encoders: 1) BERT-multi-cased (covers 104 languages) and 2) BERT-multi-uncased (covers 102 languages). We use the "BERT-multicased" model as it fixes the normalization issue in several languages and recommended to outperform for most of the languages. The BERT model is trained on Wikipedia and uses a common vocab of size 119,547 which is shared among 104 languages.
We adopt the BERT tokenizer for text encoding and feed the resulting encoded vector to the BERT model for training with 3 epochs. The BERT supports maximum sequence length (SEQ_LEN) up to 512 but Twitter messages are short, so we fixed it to 256 for Arabic-A & Danish-A and 128 for the rest of the subtasks. For Arabic-A, Danish-A, English-A, English-B, English-C, Greek-A, and Turkish-A, F1score obtained using the BERT model is 0.8172, 0.7672, 0.9060, 0.6711, 0.6054, 0.8304, and 0.7481, respectively.

Attention aware BiLSTM Model: Attn-BiLSTM
Second, we apply the BiLSTM with an Attention layer on its top to capture the semantic context (see Figure 1). The model architecture consists of five components: 1) Input layer, 2) Embedding layer, 3) BiLSTM layer, 4) Attention layer, and 5) Output layer on the top. The input layer takes word tokens extracted from the pre-processed tweet messages and feeds to the embedding layer. Embedding layer maps the token index into a lower-dimensional space. BiLSTM layer tries to identify the higher-level features. Attention layer creates a weight vector and concatenates it with the word level features to get a sentence level feature vector. The output layer classifies the sentence level feature into a predefined set of class-labels.
Data cleaning layer: We apply data cleaning discussed at the beginning of Section 3. Input layer: We apply word tokenizer to get a list of token (word) for the given twitter message. After that, we assign each token in the corpus a unique index and then apply padding to each token vector to get the fixed-length input vector. Suppose for the given sentence, the generated fixed-length vector is {x1, x2, …, xT} where xi represents a unique word index in the given train corpus, and T is the length of the input sentence.
Embedding layer: Now we transform xi into a lower-dimensional space R E where E is the size of the Embedding layer using an embedding matrix (W). We initialize the weights of the embedding layer randomly. Embedding matrix W is a parameter that is learned and E is a hyper-parameter chosen by the user. The xi is transformed into ei using the following operation: = Here, vi is a fixed-length vector of size |V| (i.e., the total number of unique tokens in the corpus) which has value 1 at index ei and 0 at rest. After applying the embedding, the transformed real-valued vector {e1, e2, …, eT} is generated which are fed to the next layer. Here, T is the sentence length. For Arabic, we fixed E=128 and for the rest of the languages, it is fixed to 256.
Dropout layer: We apply dropout on the Embedding layer to avoid model overfitting. The optimal dropout value found to be 0.3 across all the subtasks.
BiLSTM layer: The traditional LSTM model processes the text sequences in temporal order and captures only the past context, whereas BiLSTM captures the past as well as the future context. We apply BiLSTM to exploit the context from both the directions. BiLSTM consists of two LSTM layers: forward LSTM and backward LSTM to compute forward (ℎ ⃗ ) and backward (ℎ ⃖⃗ ) representations, respectively. The forward LSTM processes the input vector {e1, e2, …, eT} from e1 to eT whereas the backward LSTM layer processes the text sequence in the opposite direction from eT to e1. These two representations are concatenated to get the i th word representation hi= [ℎ ⃗ ; ℎ ⃖⃗ ]. We have used 128 forward and 128 backward LSTM units for the Arabic language subtask-A and 256 forward and 256 backward LSTM units for the rest of the subtasks. Attention layer: In offense analysis, all words do not equally contribute to understand the context. The attention mechanism is applied to capture the relative influence of various words used in the tweet messages by assigning a weight ai to each word representation. Suppose H is a matrix that consists of BiLSTM layer output vectors [h1, h2, …, hT] where T is the sentence length. The representation r is fixed across all the input sentences and computed using the below equations: is the transpose of a trainable parameter vector, w. Suppose, H  R d×T where d is the dimension of word vector and T is the length of the input sentence. The dimensions of w, α, and r vectors are d, T, and d, respectively. The final sentence representation is computed using the following equation: ℎ * = tanh( ) Output layer: We feed the h* representation to the fully connected softmax layer which outputs a probability distribution over all classes.

Unified multilingual BiLSTM Model: UML_BiLSTM
We also apply the BiLSTM model (discussed in subsection 3.2) for the combined Subtask-A in the multilingual environment (named as a unified multilingual BiLSTM viz. UML_BiLSTM). The benefit of the UML _BiLSTM model is twofold, first having one model reduces the overall model training time, and second, it reduces the manual effort of maintaining various models. Figure 2 depicts the UML_BiLSTM model pipeline.

Data Specification and Experimental Analysis
In this section, we present the data description and the experimental analysis for SemEval-2020 Task 12. We first report the submitted accuracy and then enhanced accuracy. We perform text normalization for foreign languages which results in improved accuracy. Also, we filter all the digits from the tweet messages as they are not contributing to offense analysis. We also combine all language data of subtask-A to train a single unified BILSTM model (described in subsection 3.3) for the multilingual environment. We report the accuracy in terms of macro-F1-score which gives equal weightage to all the classes and thus is a reliable metric in the presence of class-imbalance in the data.

Data Specification
The datasets, given by the organizers, contains five different languages for subtask-A and only the English language for subtasks-B & C. The datasets are collected from the social media . A complete description of the datasets collection and annotation techniques for Arabic, Danish, English, Greek, and Turkish languages can be found in (Mubarak et al., 2020), (Sigurbergsson and Derczynski, 2020), , (Pitenis et al., 2020), and (Ç Çöltekin, 2020), respectively. Table 1 shows the train-test data size and the vocab-size after pre-processing the train-set for each subtask. Large data samples for English subtask-A results in higher vocab-size compare to English subtask-B and English subtask-C. Vocab-size also depends on the richness of the language and thus might vary across different languages even though the sample size is the same which can be observed from the table.

Experimental Results and Analysis
We divide the train-set into train and the development sets for hyper-parameter-tuning for the BERT and BiLSTM models. For all the subtasks, we submit the best accuracy achieved on the development set.
In Table 2, we showcase the macro F1-score for all the subtasks. We report the F1-score of our submitted results and the respective applied model name. The submission results are using the BERT model for Greek-A and Turkish-A subtasks as it was giving better accuracy on the development set. We couldn't submit the results for the English-B subtask. The last column shows the best system accuracy of the participant who scored first rank in SemEval Task 12.
We also report the enhanced accuracy result for the BERT and BiLSTM models. The BiLSTM model outperforms on Arabic-A and Greek-A whereas the fine-tuned BERT model outperforms on the rest of the subtasks (see table 2). The BERT model uses a common vocab of size 119,547 (shared among 104 languages) in which Arabic and Greek vocab sizes are 4,873 and 1,566, respectively. But in the given datasets after removing redundant tokens, the vocab-size is 33,987 and 23,480 for Arabic-A and Greek-A, respectively. Our BiLSTM model learns better by using larger vocabulary, resulting in better accuracy on Arabic-A and Greek-A subtasks. Moreover, the pre-trained BERT model is resource-hungry and compute-intensive. For Greek subtask-A, our enhanced accuracy shows a significant improvement over the accuracy achieved by the top ranker.
The effect of text normalization varies across different languages. This is because the normalization techniques applied to different languages vary due to the morphological variation across these five languages. For Arabic-A, initially, we use the train-dev-test sets released by the organizers and submit the results obtained from the model trained on train-set only. But later we download the updated dataset in which train and development sets are combined by the organizers. So, the enhanced results are also computed on the updated dataset for the fair comparison with the top ranker score (assuming the top ranker scores are computed on the updated dataset). In Arabic-A, this is the major factor for the drastic accuracy improvement over the submitted accuracy.  In Figure 3, we showcase the confusion matrix computed on test-sets of the respective enhanced model which achieves a higher F1-score.

Conclusion
We used the BERT and the BiLSTM model for multilingual offensive language identification and compared the results. We demonstrated in the result analysis that our attention-aware BiLSTM model outperforms the BERT model for Arabic and Greek offensive language detection, while BERT performed better for Danish, English, and Turkish offensive language identification, English offense-type categorization, and English offense target identification. Further, we combined Arabic, Danish, English, Greek, Turkish multilingual offensive language identification data to train our UML_BiLSTM model. The obtained UML_BiLSTM model results were poor by -0.20 to 11.29% and 1.95% to 12.37% compared to the BERT and the BiLSTM, respectively. However, the UML_BiLSTM model training was 2.81 times and 1.34 times faster in comparison to the BERT and the BiLSTM models, respectively. UML_BiLSTM model can be further improved by introducing a data balancing mechanism before training the model. Our submitted F1-scores are 0.4536, 0.6710, 0.8900, 0.8020, 0.7421, and 0.5259 for Arabic subtask-A, Danish subtask-A, English subtask-A, Greek subtask-A, Turkish subtask-A, and English subtask-C, respectively. After submission, we further worked on enhancing the models for achieving better accuracy. We normalized the data for foreign languages before feeding to the BiLSTM model and got improved results. Especially for the Greek language, our achieved accuracy was significant compare to the Of-fensEval-200 top ranker F1-score. The top ranker accuracy was 0.8520 and our improved accuracy was 0.8984. Therefore, it has been proven that the normalization of social media posts is one of the important steps before applying model. The improved F1-scores are 0.8447, 0.7672, 0.9060, 0.8984, 0.7489, 0.6711, and 0.6054 for Arabic subtask-A, Danish subtask-A, English subtask-A, Greek subtask-A, Turkish subtask-A, English subtask-B and English subtask-C, respectively. In the future, we also plan to evaluate our model on a wide range of languages.