XD at SemEval-2020 Task 12: Ensemble Approach to Offensive Language Identification in Social Media Using Transformer Encoders

This paper presents six document classification models using the latest transformer encoders and a high-performing ensemble model for a task of offensive language identification in social media. For the individual models, deep transformer layers are applied to perform multi-head attentions. For the ensemble model, the utterance representations taken from those individual models are concatenated and fed into a linear decoder to make the final decisions. Our ensemble model outperforms the individual models and shows up to 8.6% improvement over the individual models on the development set. On the test set, it achieves macro-F1 of 90.9% and becomes one of the high performing systems among 85 participants in the sub-task A of this shared task. Our analysis shows that although the ensemble model significantly improves the accuracy on the development set, the improvement is not as evident on the test set.


Introduction
With the development of IT, social media has become more and more popular for people to express their views and exchange ideas publicly. However, some people may take advantage of the anonymity in social media platform to express their comments rudely, and attack other people verbally with offensive language. To keep a healthy online environment for the adolescences (Chen et al., 2012) and to filter offensive messages for the users (Razavi et al., 2010), it is necessary and significant for technology companies to develop an efficient and effective computational methods to identify offensive language automatically.
Transformer-based contextualized embedding approaches such as BERT (Devlin et al., 2019a), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020) or ELECTRA (Clark et al., 2020) have re-established the state-of-the-art for many natural language classification tasks especially the GLUE Dataset (Wang et al., 2018). Their pre-trained models were pre-trained on different large datasets, for example, BERT was pre-trained on the BOOKCORPUS (Zhu et al., 2015) and English Wikipedia, and RoBERTa was pre-trained on CC-NEWS (Nagel, 2016), OPENWEBTEXT (Gokaslan and Cohen, 2019), and STORIES (Trinh and Le, 2018) which enable their models to learn different language features. This paper presents six transformer-based offensive language identification models that learn different features from the target utterance. To combine the distinctive learned language features, we introduce an ensemble strategy which concatenates the representations of the individual models and feed them into the linear decoder to make binary classification (Section 4.2). It largely improves the performance over the baseline on our dev set (Section 4.4).

Data Description
The datasets we use are Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019) and Semi-Supervised Offensive Language Identification Dataset (SOLID) . Given a tweet, the task is to predict whether the content involves offensive language. Table 1 shows the examples of offensive and non-offensive tweets in these two datasets.  OLID is a collection of 14,100 English tweets annotated as OFF or NOT. It is divided into a training set of 13,240 tweets and test set of 860 tweets (Zampieri et al., 2019). SOLID is a collection of about 9 million English tweets labeled in a semi-supervised manner . The data are annotated with AVG_CONF and CONF_STD predicted by several supervised models . The test set provided by organizers this year has 3887 tweets. Table 2 shows the statistics of OLID and SOLID.

Data Split
For our experiments, a combination of OLID and SOLID (Section 3) is used. We find that about 1.0% of SOLID are duplicates, which have been removed before data splitting. For the dataset used for fine-tuning classification model, we set threshold of AVG_CONF (Section 3) to be 0.5 in SOLID, which means the data with AVG_CONF above 0.5 is labelled as OFF. 90% of the TRN of OLID is combined with the whole SOLID as the new training set TRN for default transformer-based models fine-tuning (FT). The remaining 10% of the TRN and the TST of OLID is used as the development set DEV of FT. All the existed datasets are combined together as the training set TRN for model pre-training (PT). After pre-training, 99.5% of the SOLID is randomly selected as the training set TRN and 0.5% of the SOLID is randomly selected to create the development set DEV for fine-tuning our pre-trained models into classification models and regression models (PT-C and PT-R). In PT-C, the data with AVG_CONF above 0.5 is labelled as OFF and in PT-R, original value of AVG_CONF is used. Furthermore, 90% of TRN in OLID is randomly selected as the new training set TRN, and 10% of TRN in OLID is combined with the TST of the OLID and become the development set DEV for classification models and regression models' further fine-tuning (PT-C-C and PT-R-C). The ensemble model is fine-tuned on the same dataset as PT-C-C. Table 3 shows the detailed statistics of the data split in our experiments.  dataset used for default model fine-tuning, PT: dataset used for default model pre-training, PT-R: dataset used for fine-tuning our pre-trained models into regression models, PT-C: dataset used for fine-tuning our pre-trained models into classification models, PT-R-C: dataset used for fine-tuning regression model into classification models, PT-C-C: dataset used for further fine-tuning classification models, E: dataset used for fine-tuning ensemble models.

Models
In general, default transformer-based models are fine-tuned as baseline models. The sequence of embeddings of input generated from the transformer encoder is fed into linear decoder to gain the output vector that makes the binary classification. Then we pre-train these default models and choose the models with lowest perplexity. Next, we fine-tune the pre-trained models into regression models and classification models based on corresponding dataset, respectively. Furthermore, the regression models and classification models are fine-tuned again into classification models. In the end, sentence presentation of individual models are concatenated and fed into linear decoder to generate the output vector that makes the binary decision of whether or not this tweet is offensive. In our experiments, two types of transformer-based models are used as the default models, BERT-Base model (Devlin et al., 2019b) and RoBERTa-Base model (Liu et al., 2020). For the default model finetuning part, the default BERT-Base and RoBERTa-Base are fine-tuned on FT (Section 4.1) as baseline models. For the pre-training part, the BERT-Base and RoBERTa-Base are pre-trained on PT (Section 4.1). Then, the two pre-trained models which have the lowest perplexity are fine-tuned into regression models and classification models separately on PT-R and PT-C. Next, the fine-tuned pre-trained models are further fine-tuned into classification models on PT-R-C and PT-C-C. Finally, sentence presentation of six individual models are concatenated to form the ensemble model which is fine-tuned on E. Figure 1 shows the overview of the six individual models and the ensemble model.

Experimental Setup
According to our experiments, the data preprocessing doesn't contribute significantly to the final prediction results on such huge dataset. Thus, we skip the data preprocessing. According to the analysis of sentence length in the dataset, we set max_length of the models to be 128. After an extensive hyper-parameter search, we set learning_rate to be 2e − 5, seed_value to be 42, and epochs to be 10 for our six individual models and ensemble model. After that, we also experiment more on the ensemble model and find that the best result is gained by changing learning_rate to 1e − 5 and dropout to 0.5. Table 4 shows the results achieved by our individual models and ensemble model. The selected pretrained BERT-base model and pre-trained RoBERTa-base model have the lowest perplexities, which are 21.3 and 47.5. Our fine-tuned pre-trained classification-classificaion BERT and RoBERTa models outperform their counterpart baseline by about 1.7% and 1.1%, respectively. In addition, our fine-tuned pre-trained regression-classification BERT and RoBERTa models show 2.1% and 1.8% improvements over their baselines. The ensemble model with learning_rate of 1e − 5 and dropout of 0.5 (E_2) achieves significantly improvement on development set. It outperforms the BERT baseline and RoBERTa baseline by 8.5% and 8.6%, respectively. As a result, we use this ensemble model as our final model and submit the prediction results to the shared task's CodaLab page. 1 We achieve a macro-F1 score of 90.901% on the test set and rank 36th among 85 participants in sub-task A. After the release of the gold labels, we also calculate our other models' performance on test set (Table 4) and make detailed comparison and analysis among them (Section 4.5.1).

Ablation Analysis
When we fine-tuned our pre-trained models, B-PT-C, B-PT-R, R-PT-C, and R-PT-R on only 10% of the PT-R and PT-C (Section 4.4) separately, the accuracy of models, B-PT-C-C, B-PT-R-C, R-PT-C-C, and R-PT-R-C we get is 82.822%, 83.326%, 83.280%, and 83.646%, which is lower than the results using total data (Table 4). It indicates that deep learning models which are trained on larger dataset perform better. For the ensemble model, when we decrease the learning_rate from 2e − 5 (E) to 1e − 5 (E_LL), the performance improves from 88.548% to 90.701%, which shows that the ensemble model is sensitive to the change in learning rates. By changing the default dropout from 0.1 (E_LL) to 0.5 (E_HD), the model performance increase to 90.884%, which indicates the influence of the dropout rate. After comparing the predicted labels from our unsubmitted models with the released gold labels (Table 4), we can see the model which achieves the highest accuracy on the development set doesn't perform best on the test set. which may be caused by overfitting. Pure fine-tuned BERT-base model (B_FT) achieves the same accuracy as other two ensemble models. In addition, higher accuracy can't guarantee the higher f1-score due to the data imbalance.

Error Analysis
The confusion matrix in Figure 2 further displays the error pattern of our classifier on test set. As we can see, there are only three instances labeled with OFF are misclassified to NOT while more data labeled with NOT are classified to OFF. Table 5 shows these three misclassified offensive examples and other misclassified not offensive tweets. One explanation of the results may be that the imbalance of the dataset leads to the classifier's preference for the majority class. It is possible that our classifier may not capture some of the subtle nuances in meaning and contexts, and our system still needs some improvement for these subtle details.

Conclusion
This paper explores the performance of six individual transformer-based models and their ensemble model for the task of offensive language identification in social media. Default BERT-Base and RoBERTa-Base individual fine-tuning models are adapted to establish the strong baselines for the ensemble model. Sentence representations from six individual models are concatenated and fed into the linear decoder to make binary decision for the ensemble model. Our ensemble model with higher dropout shows significant improvements on accuracy, up to 8.6%, on the dev set than baseline models. However, it performs worse than the baseline model B-FT and original ensemble model E on the test set, which has a 92.153% accuracy. It may be caused by model overfitting and data imbalance, which are the problems we need to take into consideration in future experiments.