TECHSSN at SemEval-2020 Task 12: Offensive Language Detection Using BERT Embeddings

This paper describes the work of identifying the presence of offensive language in social media posts and categorizing a post as targeted to a particular person or not. The work developed by team TECHSSN for solving the Multilingual Offensive Language Identification in Social Media (Task 12) in SemEval-2020 involves the use of deep learning models with BERT embeddings. The dataset is preprocessed and given to a Bidirectional Encoder Representations from Transformers (BERT) model with pretrained weight vectors. The model is retrained and the weights are learned for the offensive language dataset. We have developed a system with the English language dataset. The results are better when compared to the model we developed in SemEval-2019 Task6.


Introduction
The usage of offensive or hate words is increasing these days, largely in online communication. People find it easier to express their opinions and thoughts in online sources rather than do it personally. The anonymity provided by the online environment encourages many people to express their views in aggressively. The information spread rate is extremely fast in online social media. People, without checking the validity of the information they receive, spread it to others.
The text content posted in messages, websites, social media and blogs are highly unstructured, informal, often misspelt, and use shorthanded notations, emojis and emoticons. Getting the meaning in the natural text is a complicated task. There is ongoing research in the field of offensive language or hate speech detection, yet it remains only a goal to obtain a full-fledged model. We have participated in the offensive language task conducted in SemEval-2019 by Zampieri et al. (2019a) with various machine learning and deep learning models in Rajalakshmi et al. (2019a). For SemEval-2020 by , we have developed and tested deep learning models with different word embeddings. We have used GloVe (Pennington et al., 2014), Word2Vec (Mikolov et al., 2015) and BERT embedding (Devlin et al., 2018) pretrained models to identify the presence of offensiveness in tweets. We have participated in subtask A (OFF/NOT) of classifying whether a tweet is offensive or non-offensive and subtask B (TIN/UNT) of classifying whether an offensive tweet is targeted to anyone or untargeted.
The rest of the paper is organized as follows. Section 2 surveys the related work in this field. Section 3 describes the methodology used to solve the task. Results are discussed in section 4 and conclusion in section 5.  and Mandl et al. (2019). Most of the work use the machine learning and deep learning techniques.
The work done by Wu et al. (2019) uses BERT uncased model with an F1 score of 0.8057 for task A and 0.50 for task B. Pavlopoulos et al. (2019) uses perspective API and BERT cased and uncased models to detect the offensive language with an F1 score of 0.7933 for task A and 0.6817 for task B. SemEval-2019 Task6 report of Zampieri et al. (2019b)

System Methodology
We have participated in SemEval-2019 Task 6 to identify and categorize English tweets using machine learning and deep learning models Rajalakshmi et al. (2019a), where it was found that 1D-CNN with GloVe embeddings and 2D-CNN with Word2Vec embeddings have performed better compared to other machine learning and deep learning algorithms. For SemEval-2020 Task 12, in addition to these algorithms, we also used deep learning model with BERT embeddings. The proposed system comprises of the following modules: • Dataset preparation • Preprocessing • Train the model using deep learning techniques • Prediction performance of the model

Dataset Preparation
The Semi-Supervised Offensive Language Identification Datatset (SOLID) used for developing the system comprises 9,089,140 instances in training dataset for subtask A and 3,887 instances in test dataset. It has 188,974 offensive instances in training dataset for subtask B and 1,422 instances in test dataset listed by . The break up of the dataset and its classes are shown in Table 1. Since the available infrastructure in our lab does not support working with the entire dataset for subtask A, we have used 1,000,000 instances to train the model. Each entry in the dataset consists of the features Id, Tweet, Avg conf and Conf std. Avg conf is the average of the confidences predicted by several supervised models for a specific instance to belong to the positive class for that subtask. The class labels are OFF (offensive) and NOT (not offensive) for subtask A, while TIN (targeted insult and threat) and UNT (untargeted) for subtask B. Conf std is the standard deviation of confidences for a particular instance. We have taken the threshold to be 0.5 for Avg conf and the values greater than the threshold are taken as positive examples and the rest as negative examples. The dataset is prepared in the format Id, Tweet, Label suitable to train the model.

Preprocessing
Unstructured tweet data contains a lot of irregularities which will affect the accuracy of the model. Therefore, it is important to preprocess the data before using it to build the model. Data preprocessing is an important step for increasing the performance of the model. The data is preprocessed by removing the irregularities, smoothening and normalizing the dataset. We have used NLTK (Bird et al., 2009) and Spacy toolkits (Honnibal and Montani, 2017) to preprocess the dataset. The preprocessing steps that we have developed in Rajalakshmi et al. (2019a) are listed below.
Step e. is omitted for subtask B, since stopwords are significant only for target identification. We preprocessed the data with the steps outlined above and built the model using plain text. Furthermore, we can analyze the importance of accented characters, special characters and fully uppercase words and how they affect the performance of the system.

Model Building
Various deep learning techniques with different word embeddings are applied on the SOLID dataset and their performances are analyzed. Our work in SemEval-2019 task 6 showed that 2D-CNN model with Word2Vec embeddings and 1D-CNN model with GloVe embeddings performed better than all other machine learning and deep learning algorithms with different word embeddings. For the present task, we have used LSTM and BERT models in addition to those algorithms.

2D-CNN with Word2Vec Learned Embeddings
We have used 2D-Convolutional Neural Network (CNN) model with Google's Word2Vec pretrained weights as in Rajalakshmi et al. (2019b). The model is then retrained to relearn the weights for OffensE-val2020 SOLID dataset. The structure of the model comprises of the following layers.
1. Input layer 2. Embedding layer 3. Convolutional layer with kernel size 2, 3 and 4 4. Pooling layers for CNN layers 5. Fully connected dense layer 6. Output layer Embedding layer is used to relearn the weights of embedding matrix. Kernel filters are used to process bigrams, trigrams and fourgrams. Max pooling layer is used to scale down the output vectors to dense feature vectors that are concatenated and flattened in fully connected layer. Output layer consists of 2 units for OFF/NOT or UNT/TIN. The word-grams are concatenated and computed in parallel to extract the possible information from the vectors. This enables the classifier to understand the relationship between the words. The parameters for the model are set as follows: sequence length of the model is 43, learning rate is set as 0.001 and dropout is set as 0.5. Softmax activation function is used for output layer and Relu activation function in other layers.

1D-CNN with GloVe
Conventional CNN with one dimensional layer is used with GloVe embeddings with 1 million word vectors of 200 dimensions from twitter data. The embedding layers are used to extract the skip-grams. Convolutional 1D layers use kernel filters of size 2, 3 and 4. Dropout value is set as 0.2 and 100 filters are also used. Maxpooling 1D layers are used to select the bigram, trigram and fourgram branches and they are merged for further processing. Softmax function is used in output layer and Relu function is used in all other layers. This model has fewer trainable parameters and takes less time to train.

BiLSTM
Recurrent Neural Network (RNN) is especially designed to work with sequential data. Long Short-Term Memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), an extension of RNN, are connected in a special way to avoid vanishing and exploding gradient issues. Bidirectional LSTMs can capture information about the past and future states simultaneously. We have used 2 LSTM layers for bidirection with 150 units and the inputs are trained with a batch size of 128 and dropout value of 0.2. Sigmoid function is used as the activation function in output layer and Adam algorithm is used for optimization.

BERT
Bidirectional Encoder Representations from Transformers (BERT) is a deep bidirectional network built using transformers, which is pre-trained to detect a masked word in the given context sentence described by Devlin et al. (2018). It can also be used for text classification and semantic relation extraction. We have used the publicly available BERT-Base, Multilingual cased pre-trained model for 104 languages that includes English, Tamil, Telugu, Hindi, Spanish, Arabic, Turkish, Urdu, Danish, Chinese, French and, Greek etc. This model has 12 transformer layers, 768 hidden layers and 12 heads with 110M parameters. This model is well suited for any of the given languages for task 12 given by . GloVe and Word2Vec embeddings are context-free, while BERT embeddings use contextual representation for word embeddings. Context-free representation gives a single word embedding for each word irrespective of its prefix or suffix. Hence the word "bank" in "bank deposit" and "river bank" has same representation. In contextual representation the word "bank" has different representations based on the context of its nearby words; the contextual representation is considered in both forward and backward directions. Since BERT uses contextual knowledge in decision making, it will provide better interpretation than the context-free GloVe and Word2Vec embeddings as shown in the results.
We have used the CoLA (The Corpus of Linguistic Acceptability) dataprocessor given in Warstadt et al. (2018) that is mainly used for single sentence classification task. The sequence length is set to 128, and the batch size to 32. Adam optimizer is used and the learning rate is set as 0.2 to minimize the training time since the data is huge. The tweet sequence is tokenized and converted into features. The model is initialized with pretrained weight vectors and retrained to learn the input dataset features.

Results and Discussion
Various deep learning models with different word embeddings are used to detect the presence of offensiveness in the given tweet and to identify whether the tweet is targeted or not. Table 2 shows the models used to classify the given tweet into offensive or non-offensive category in subtask A. 1D-CNN model is trained with GloVe pretrained embeddings, 2D-CNN and BiLSTM models with Word2Vec embeddings. Deep network model with BERT embeddings achieves better F1 score when compared to other models.   Table 3 shows the results for classifying the tweets into targeted and untargeted. The targeted tweet refers to a particular person or organization or group of people. Results show that BERT model performs better than other models. Among the SemEval-2020 teams participated for task 12, we were ranked 72 in subtask A and 36 in subtask B.

Conclusion and Future work
There is an increase in the usage of profanity words in online communications due to the ease with which speakers can remain anonymous. People comment about a particular person or an organization or a group of people in an aggressive manner in social media. Due to the fast transmission of online communications, this information is spread rapidly. This has led to the need to detect the offensive tweets and to remove and stop them from spreading further. SemEval-2020 Task 12 involves three subtasks in which we have participated in subtasks A and B. Deep learning models with different word embeddings were used to perform the tasks. Results show that deep network model with BERT embeddings performs better when compared to GloVe and Word2Vec embedding models. Since the BERT model is based on multilingual case-based representation, this can also be applied to other languages like Tamil, Hindi, Telugu, Kannada, Arabic, Danish, Turkish and Greek. We would like to do further investigation on applying this model to other languages.