YNU_oxz at SemEval-2020 Task 12: Bidirectional GRU with Capsule for Identifying Multilingual Offensive Language

This article describes the system submitted to SemEval-2020 Task 12 OffensEval 2: Multilingual Offensive Language Recognition in Social Media. The task is to classify offensive language in social media. The shared task contains five languages (English, Greek, Arabic, Danish, and Turkish) and three subtasks. We only participated in subtask A of English to identify offensive language. To solve this task, we proposed a system based on a Bidirectional Gated Recurrent Unit (Bi-GRU) with a Capsule model. Finally, we used the K-fold approach for ensemble. Our model achieved a Macro-average F1 score of 0.90969 (ranked 27/85) in subtask A.


Introduction
Offensive language is ubiquitous in social media, and individuals often uses the anonymity of computer communications for some anti-social network behaviors, including cyberbullying (Xu et al., 2012), malicious provocation (Kwok and Wang, 2013), and offensive language (Cheng et al., 2017). The widespread dissemination of offensive content in social media is a cause of concern for governments and many technology companies around the world. One of the most common and effective strategies for solving offensive language problems on the network is to train systems that can recognize such content.
SemEval-2020 OffensEval 2 is proposed for multilingual offensive language recognition in social media (Zampieri et al., 2020). The shared task contains three subtasks and five languages (English, Greek, Arabic, Danish, and Turkish), where subtask A is a coarse-grained binary classification, which aims to the identification of offensive language. Participating systems need to divide Tweet into two categories: Offensive (OFF) and Not Offensive (NOT). In this competition, we only participated in subtask A of the English language. We used deep learning to build a bidirectional GRU(Bi-GRU) with Capsule model (Yang et al., 2018), among them, GRU is simpler and more efficient than the traditional LSTM model (Chung et al., 2014). Our model used bidirectional GRU (Bi-GRU) (Bahdanau et al., 2014) to process the sequence from two directions, utilizing both the previous and future context, and capsule is a group of neurons that use vectors to represent parameters, capsule network uses the inner product method to cluster the input features.
The rest of this article is organized as follows. Section 2 introduces related work on multilingual offensive language identification. Section 3 describes the models and data. Section 4 presents the experimental results. Finally, we summarize in Section 5.

Related Work
In recent years, offensive language has prevailed on social media, and social media has become the most popular media among users. According to the survey (QEV Analytics and of America, 2009), it was observed that 70% of adolescents used social media sites every day, and users shared their opinions This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ through social media such as Twitter, Facebook, Microblog, etc. Kumar et al. (2018) attempted to identify hate speech. On the one hand, users benefit from social media by learning or interacting with other users; on the other hand, they face offensive online content. In light of more recent survey of hate speech and offensive language detection, we recommend Schmidt and Wiegand (2017) and Fortuna and Nunes (2018). Schmidt and Wiegand (2017) investigated features widely used for hate speech detection, including simple surface features, word generalization, 88 knowledge-based features, etc. Fortuna and Nunes (2018) believed that the field of automatic detection of hate speech and offensive language in text is very important for online social platforms and has unquestionable potential for social impact. Davidson et al. (2017) presented the results of hate speech detection using word n-grams and emotional vocabulary, and provided insights into examples of misclassification.
In addition to recently published research, a number of related sharing tasks have been organized. Among them, Gemeval2018 (Wiegand et al., 2018) is about offensive language recognition and aims to promote research on offensive content recognition in German language microblogs. The best team's system is to train three basic classifiers (maximum entropy and two random forest sets) using five disjoint feature sets, and then used the maximum entropy element-level classifier for final classification (Montani, 2018). In the SemEval-2019 shared tasks HatEval (Basile et al., 2019) and OffensEval (Zampieri et al., 2019b), HatEval is a multilingual detection of hate speech against immigrants and women on Twitter. Fermi team is the best team of Hateval. It proposes a SVM model with RBF kernel and uses sentence embedding in Google general sentence encoder as a function (Indurthi et al., 2019). OffensEval is about the identification and classification of offensive language in social media. The NULI team is the best performing team, they use BERT-base without default parameters . HASOC2019 (Mandl et al., 2019) is proposed to identify hate speech and offensive content in Indo-European languages. Its purpose is to develop powerful technologies capable of processing multilingual data and to develop a transfer learning method that can utilize cross-lingual data. The optimal system is a system based on ordered neuron LSTM (ON-LSTM) and attention model and adopts K-folding approach for ensemble (Wang et al., 2019).

Data description
We only participated in English subtask A. The official English dataset provided this year is different from the Offensive Language Identification DataSet (OLID) (Zampieri et al., 2019a). The format of the dataset instance is as follows: where AVG CONF is the average of the confidences predicted by several supervised models for a specific instance to belong to the positive class for that subtask. The positive class is OFF for subtask A. CONF STD is the confidences' standard deviation from AVG CONF for a particular instance. For official provided English datasets containing scores rather than labels, the scores are confidence measures produced by unsupervised learning methods. We used a 0.5 average confidence threshold (AVG CONF) to map the scores to the OffensEval labels. Based on the principle that the more the number of training sets, the better the performance of the model may be. We randomly selected the maximum 100,000 pieces of data that our experimental equipment can accommodate as the training set. And based on the number of OffensEval 2019 test sets, we randomly selected 3887 data consistent with the number as the validation set data required for this experiment.

Data preprocessing
We performed some operations to pre-process the data, Tweets was processed using the Tweetokenize tool 1 . We used Emoji substitution and HashTag segmentation and all "@use" is replaced with username, and the frequency of consecutive "@USER" is limited to three times to reduce redundancy. We also removed punctuation, replaced all uppercase letters with lowercase letters, restored abbreviations, etc.

Bi-GRU with a Capsule model
Our proposed network architecture is shown in Figure 1. Our model is built on Bi-GRU with Capsule, where GRU is a variant of LSTM. Next, we briefly describe the details of the system. • Embedding layer: The embedding layer converts words in an existing dictionary input through a pre-trained word vector model into vectors.
• Encoding layer: Chung et al. (2014) proposed an LSTM variant called gate recursive unit (GRU), GRU is to combine the forget gate and input gate in LSTM into update gate. It makes GRU simpler and more efficient than traditional LSTM models (Wang et al., 2018). In the encoding layer, we used the structure of a bidirectional GRU to encode vectorized text to establish this contextual connection. Bi-GRU is a neural network model consisting of unidirectional GRU with opposite directions and whose output is jointly determined by the states of these two GRUs. The input of the forward GRU is the forward sequence of the input of the previous layer, and the input of the backward GRU is the reverse sequence of the input of the previous layer. At each moment, the input provided two GRUs with directions opposite at the same time, and the output is decided by the two unidirectional GRUs jointly. The current hidden layer state of Bi-GRU is jointly determined by three parts: current input x t , output − − → h t−1 of forward hidden layer state and output ← − − h t−1 of backward hidden layer state at t-1 moment: where the GRU() function represents a non-linear transformation of the input word vector, and encodes the word vector into the corresponding GRU hidden layer state. − → h t and ← − h t respectively represent the forward hidden state and the backward hidden state corresponding to the bidirectional GRU at t moment; h t express the vector that contact − → h t with ← − h t .
• Capsule layer: In the deep learning model, spatial patterns are aggregated at a lower level, which helps to represent higher-level concepts. We used the Capsule network to enhance the model's feature extraction capabilities, spatial insensitivity methods are inevitably limited by the abundant text structure (such as saving the location of words, semantic information, grammatical structure, etc.), difficult to effectively encode and lack of text expression ability. The Capsule network effectively improved this disadvantage by using neuron vectors instead of individual neuron nodes of traditional neural networks to train this new neural network in the dynamic routing way. Capsule's parameter update algorithm is routing-by-agreement (Sabour et al., 2017), a lower-level capsule prefers to send its output to higher-level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule. The calculation formula of Capsule is as follows: where V j is the vector output of capsule j and S j is its total input, prediction vectorsû j|i is by multiplying the output u i of a capsule in the layer below by a weight matrix W ij , the C ij are coupling coefficients that are determined by the iterative dynamic routing process.
• Output layer: This layer classifies and predicts the final aggregated information.

K-folding ensemble
In this paper, in order to enhance the overall classification performance of the model, we used a K-fold ensemble method. The design idea of this method comes from K-fold cross-validation, we randomly divided the source data into K parts and used the K-1 subsets to do the training, the remaining subset is the validation set, and then this process is repeated K times. Finally, the K results are subjected to an accumulation averaging operation to obtain the final output. The purpose of performing the Kfold ensemble is to train to different data sets during each fold training process, and to extract different features during the model feature extraction process, which can further improve the generalization ability of the model. For the unlabeled English dataset released this year, we consider whether to introduce the OLID dataset provided by OffensEval 2019. Therefore, on the same validation set, we conducted a comparative experiment on the selection of the training set. The relevant description of the data set used in this section is as described in section 3.1. As shown in Table 1, we conducted experiments on these two data sets on the same model. It can be observed that compared with the OLID data set, the randomly selected 100,000 unlabeled datasets improved on systems S1 and S2, respectively. It is 17.35% and 16.82%. Therefore, we selected randomly selected 100,000 unlabeled data sets as the training set for this experiment. On this basis, we conducted ablation experiments on the system to verify the performance of the model. Observing the change from system S2 to S6 in Table 1, we can find that system S6 reached 98.86%, which is an increase of 2.33% over system S2.

Experiment setting
In our model, the pre-trained word embedding we used is FastText 2 , which is provided by Mikolov et al. (2017). It is a 2 million word vector trained using subword information on Common Crawl with 600B tokens, and it's dimension is 300(crawl-300d-2M.vec). In the encoding layer, we set the hidden units to 32. In the Capsule layer, we set num capsule = 10, dim capsule = 16, routings = 4. The Flatten layer is connected behind the Capsule layer, this is to turn multidimensional input into one-dimensional, so as to achieve the transition from a convolution layer to a full connection layer. A layer of Dense with Relu activation function is connected behind the Flatten layer, and the number of hidden units is 16. We then added the Dropout layer and the BatchNormalization layer, with dropout is set to 0.5. In the output layer, we used the sigmoid activation function for binary classification. The loss function of this model is binary cross-entropy, and the optimizer is adam. we set the batch size to 64 and the epoch to 20 for training. Finally, we used the K-fold method for ensemble, and K is set to 5.

Result
This English subtask A evaluates the classification system by calculating a macro-averaged F1-score, which takes into account both the precision and recall of the classification model. The F1 score can be regarded as a harmonic average of the model's precision and recall. Its maximum value is 1 and its minimum value is 0. Macro-averaging is to first statistical index value for each class and then calculates the arithmetic average for all classes.

Conclusion
This year we participated in the English subtask A for multilingual offensive language recognition, which is automatically classifying offensive language in social media. This paper proposed a model for English offensive language recognition. We used the Bi-GRU model for classification, and used the Capsule network to improve the feature extraction capability of the model. Although our model overall performance is not the best, preliminary results indicate what we should do next. In future research, we consider the impact of unlabeled datasets on the results, we will consider the introduction of transfer learning, and how to optimize the parameters is also a very important issue. We will also consider offensive language recognition in other languages (Greek, Arabic, Spanish and Danish).