UJNLP at SemEval-2020 Task 12: Detecting Offensive Language Using Bidirectional Transformers

In this paper, we built several pre-trained models to participate SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media. In the common task of Offensive Language Identification in Social Media, pre-trained models such as Bidirectional Encoder Representation from Transformer (BERT) have achieved good results. We preprocess the dataset by the language habits of users in social network. Considering the data imbalance in OffensEval, we screened the newly provided machine annotation samples to construct a new dataset. We use the dataset to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa). For the English subtask B, we adopted the method of adding Auxiliary Sentences (AS) to transform the single-sentence classification task into a relationship recognition task between sentences. Our team UJNLP wins the ranking 16th of 85 in English subtask A (Offensive language identification).


Introduction
With the explosive growth of data generated by online social network users, the malicious content mixed in the information published by users has brought great challenges to the detection of hate speech and offensive language. Due to the difference between social media and traditional media, users are allowed to post information at will. The emergence of large amounts of data invalidates manual review, resulting in a large number of methods that use machine learning for automatic classification. Therefore, SemEval 2020 released OffensEval2 , which compared to OffensEval (Zampieri et al., 2019b) increased the size of the dataset and added multilingual. Multilingual data sets have been added in OffensEval2, includes Arabic (Mubarak et al., 2020), Danish (Sigurbergsson and Derczynski, 2020), English , Greek (Pitenis et al., 2020), and Turkish (Çöltekin, 2020). This task of English is divided into three subtasks: offensive language recognition, automatic classification of attack types, and attack target recognition.
Our method for this task of English is based on BERT and RoBERTa. The organizer provided a large number of samples marked by the machine and included the confidence of each sample. Because deep neural networks rely on large-scale datasets, the results obtained when using only the OLID dataset are not ideal. Due to equipment limitations and considering the error of machine labeling samples, we did not use all data for training. Based on the OLID dataset (Zampieri et al., 2019a), we use the data with a confidence higher than threshold to expand the dataset to make the number of positive and negative samples equal to avoid the problems caused by the balance of dataset. For subtask B, we add Auxiliary Sentences(AS) to transform the single-sentence classification task into a problem of relationship recognition between sentences. We compared the effect of BERT and RoBERTa with the NSP task removed, and BERT is better for the relationship between sentences.
We compare BERT and RoBERTa models. Finally concluded that the RoBERTa model can achieve better results on large-scale datasets. Due to time constraints, we only participated in and submitted the English subtask A. Our team UJNLP wins the 16th place (out of 85) in English subtask A. After the competition, we tested the effect of the method on subtask B and subtask C.

Related work
In recent years, many researchers and research institutions have done a lot of research on insulting language, hate speech and other offensive language. The English task of OffensEval2 consists of three subtasks. Subtask A aims to detect offensive language as not offensive (NOT) or offensive (OFF). Subtask B aims to classify the offensive language as targeting specific entities (TIN) or not targeting specific entities (UNT). Subtask C aims to determine whether the goal of the offensive position is individual (IND), group (GRP) or unknown (OTH). In addition to research in the OffensEval, other researchers also studied Chinese (Su et al., 2017) and Slovenian (Fišer et al., 2017) and German (Wiegand et al., 2018).
One of the benefits of transfer learning is that it can learn effectively from limited labeled data. The Bidirectional Encoder Representation from Transformer (BERT) model proposed by Google AI Language team (Devlin et al., 2018) is pre-trained using a large number of corpora from different sources . NULI used linear model, LSTM and BERT, and finally chose the best BERT (Liu et al., 2019). They also pointed out that the BERT model performed best in subtask A, and achieved the first place in subtask A of SemEval-2019 Task 6. In subtasks B and C, the dataset distribution is less smooth and the amount of data is less, so the effect is not as good as A. In contrast to other models, BERT uses a two-way representation to take advantage of the left and right context and deepen the understanding of the sentence by capturing long-term dependencies between the parts of the sentence (Wu et al., 2019). Kumar et al. (2019) believe that it is very important to preprocess the words. Only in this way can the words form a sentence and conform to the normal grammatical structure. Aglionby et al. (2019) first detects words before training the model. If it is an unknown word, it first uses a word segmentation tool to process it, and then performs error correction operations if it is not in the dictionary. In addition, they also adopted a deep learning algorithm based on attention, which combines BiLSTM and emoji attention. It has been proved that this processing method is very effective.

System overview 3.1 dataset
The OffensEval2020 dataset available to participants contains 13240 tweets from OLID dataset and about 10 million tweets on subtask A marked by the machine. Considering the limitations of machine performance and the error of machine labeling, we selected the samples with confidence greater than 0.88 in the machine labeling dataset and expanded them on the basis of OLID dataset to form a balanced dataset with a similar number of samples. The counts of various labels in the adjusted dataset are shown in Table 1, Table 2 and Tabel 3. The second row in the table is OLID data, the third row is the data we extended, and the fourth row is the data that combines these two items. The     (Vaswani et al., 2017) and bidirectional representation to capture the long-distance dependencies between the various parts of the sentence in the context, so that the sentence has a deeper understanding. BERT has Next Sentence Prediction (NSP) task, which makes it very suitable for the task of relationship recognition between sentences. RoBERTa RoBERTa uses more data for training and uses a larger batch. In order to capture the relationship between sentences, BERT uses the Next Sentence Prediction(NSP) task as the pre-training target task, but RoBERTa believes that the judging standard of the NSP task is too simple and remove it. In addition, RoBERTa has researched and obtained pre-trained models in different language environments (Conneau et al., 2019;Martin et al., 2019), which will provide good help for the processing of various tasks in multilingual.

Methodology
For subtask A, because we have a dataset with a large amount of data, and the subtask A is biased towards the understanding of shallow text features, we use the dataset to fine-tune RoBERTa and use it as the final submission plan. For subtask B, it is difficult to mine the shallow features of text to achieve the desired effect. Since the BERT model has a better effect on improving the relationship between sentences, Sun et al. (2019) classifies the single sentence classification task and adds the auxiliary sentence to a double sentence relationship judgment task. We use this scheme to add an Auxiliary Sentences(AS) "Do posts contain the target audience of profanity?" To the sample of subtask B, and fine-tune it on the BERT model to achieve an improvement in effect, and RoBERTa does not have this improvement. For subtask C, we only used RoBERTa for fine-tuning, but more experiments will be conducted in the future.

Experimental setup 4.1 Data Pre-processing
The organizer has pre-processed the samples in the dataset. Users and links are replaced with standard tags @USER and URL. We mainly pre-process emoticons and morphological reduction.
Emoji substitution. In order to preserve the semantic and emotional information contained in emojis, we used an online emoji project on github . This project can map emoticons to phrases, so that we can handle these contents more conveniently.
Lemmatization There are often many wrong grammatical forms in the data published on online social media. In order to ensure the standardization of embedding, we use the WordNetLemmatizer module provided by NLTK to process incomplete words.
Misc. We convert all text to lower case. Continuous "@USER" is limited to a maximum of three to reduce redundancy.

Experiments
In the subtask A and subtask C, we use RoBERTa-base. The Transformer has 12 layers with a size of 768 and 12 self-attention heads. Moreover, the softmax classification head is added on top of the pre-trained language model. The dataset is divided into 90% of the training set and 10% of the development set. We used the pre-processed training dataset as input to fine-tune the classification model. The hyper-parameters used in our fine-tuning training are as follows: The sentence length is 64, the batch size is 128, the learning rate is 1e-5, the weight decay is 1e-4, and the epochs are 10. For subtask B, we use added Auxiliary Sentence(AS) dataset to fine-tune BERT-base-cased, other hyperparameters are consistent.

Results
The macro F1 is used as a formal metric for all subtasks involved in this task. The RoBERTa model we use has a F1 of 0.9128 for the subtask A (@CodaLab). Outside of the competition, the result of using BERT (AS) to test subtask B is 0.6376 and subtask C test using the RoBERTa model was 0.6158. We show the results of each of the three subtasks in the table 5 ,Tabel 6 and Tabel 7 Accuracy and average macro-F1. In the table header, P represents the precision rate and R represents the recall rate.

Analysis
For subtask B, we use the data with added auxiliary sentences for fine-tuning. Because of the target task NSP in BERT, added the auxiliary sentences could get better results. Using the same data to fine-tune RoBERTa, the effect is reduced due to the increase in the price increase error after the auxiliary sentence. because we refer to the OLID test dataset that is inconsistent with the test set sample distribution of OffensEval2 to construct the verification set, this leads to the problem of serious deviation of the prediction result category. The comparison of the results of each model is shown in Table 8

Conclusion
In this paper, we introduced the results of the UJNLP team participating in SemEval-2020 Task 12. Due to time limitation, we only submitted the results of subtask A. Outside of the competition, we tested the effect of our scheme on subtask B and subtask C. We noticed that the task in OffensEval2019 has a class imbalance problem, which will have a significant impact on system performance. In order to offset the impact of various imbalances, we screened the data marked by the machine combined with the OLID dataset to build a balanced dataset. We use RoBERTa as the basis and use the dataset for fine-tuning. For subtask B, we add auxiliary sentences to convert single sentence classification tasks into sentence relationship recognition tasks, which to achieve good results on BERT. Our model ranked 16th in subtask A and achieved a result of macro-F1 of 0.9128. The current research results show that subtask A has achieved better results, while subtask B and subtask C still have problems to be solved. In the future, we will further use the proposed model for research on subtasks B and C.