Nova-Wang at SemEval-2020 Task 12: OffensEmblert: An Ensemble ofOffensive Language Classifiers

This paper presents our contribution to the Offensive Language Classification Task (English SubTask A) of Semeval 2020. We propose different Bert models trained on several offensive language classification and profanity datasets, and combine their output predictions in an ensemble model. We experimented with different ensemble approaches, such as SVMs, Gradient boosting, AdaBoosting and Logistic Regression. We further propose an under-sampling approach of the current SOLID dataset, which removed the most uncertain partitions of the dataset, increasing the recall of the dataset. Our best model, an average ensemble of four different Bert models, achieved 11th place out of 82 participants with a macro F1 score of 0.91344 in the English SubTask A.


Introduction
As people turn more and more to the online world for their entertainment, social and communication needs, the anonymity offered by the platforms also draw out diverse and sometimes divisive opinions that can lead to abuse, bullying and mental stress. To ensure civility of online discussions while allowing for opposing ideas to be expressed, we need an effective way to detect offensive language in social media.
This paper presents the model submitted to the English stream of SubTask A, focusing on the Offensive Language Classification, where for a given tweet we classify it as offensive (OFF) if it contains any form of profanity or targeted offense, either veiled or direct, and non-offensive (NON) otherwise.
The official OffensEval 2020 English Training Dataset  was labelled using a semi-supervised method with several different models. The dataset provides a score which corresponds to an average prediction confidence of being offensive over all the models (µ) along with the uncertainty of the prediction given by their standard deviation (σ).
The main challenge of using this dataset concerns with how to make an effective use of this large dataset as the tweets with mid-range average confidence values (µ) often have high variance (σ) as a result are imprecise and difficult to interpret. To overcome this challenge, we created two sub-sampled datasets (A & B), by removing majority of mid-range score µ (uncertain predictions) and with high variance σ from both sets, and by evenly sample from positive and negative values on the second set (set B).
We also incorporated additional datasets into our model to balance out the uncertainty of the semisupervised set. Rather than creating one single combined dataset, we trained separate models for each different dataset and experimented with various ensemble techniques. This allowed for more flexibility in tuning the impact of each dataset and allowed us to get the benefit of larger datasets without undermining smaller datasets.
We fine-tuned a BERT model for each dataset. We chose this model architecture as it has been shown to outperform models built using other structures for offensive language tasks. (Nikolov and Radivchev, 2019;
Last year edition of OffensEval 2019 proposed a three-level hierarchical schema for offensive language classification (Zampieri et al., 2019a;Zampieri et al., 2019b). This dataset considers whether (i) a tweet is offensive or not, (ii) whether the offense is targeted, and (iii) whether it is targeted towards individuals, groups or others. The associated tasks were divided into three corresponding sub-tasks. For the first subtask, offensive language classification, models with BERT (Devlin et al., 2019) architecture consistently out-performed other methods and were used by seven of the top ten teams. We followed a similar method to the top teams,  and Nikolov and Radivchev (2019), using their recommended steps for pre-processing the tweets and fine-tuning pre-trained BERT model.
In a related Kaggle competition in 2019, Jigsaw published a dataset for identifying toxicity and minimising bias in online comments (Jigsaw, 2019). The winning model was a blend of 2x XLNet, 2x BERT and GPT2 medium (Prokoptsev et al., 2019). We did not follow the approach due to time and resource constraints and instead opted to use the dataset to supplement our existing data. Waseem et al. (2017) proposed a separate two-fold topology for synthesizing different subtasks in abusive language detection by considering whether (i) the abuse is directed at a specific target and (ii) the degree to which it is explicit. They noted that implicit abuse is more difficult to identify, sometimes requiring more detailed annotation guidelines and perhaps even expert annotators.

OffensEval proposed system
We trained several models using different datasets and combined the best ones in an ensemble. In this section we describe the datasets that were considered and how the models were trained. Table 1 outlines the target distribution of each of the datasets.

Semi-Supervised Dataset for Offensive Language Identification (SOLID)
The official OffensEval 2020 English Training Dataset (SOLID)  contains over nine million tweets and was labelled in a semi-supervised manner using model built from an ensemble of PMI (Turney and Littman, 2003), LSTM (Hochreiter and Schmidhuber, 1997), FastText (Joulin et al., 2016) and BERT (Devlin et al., 2019). It followed the annotation guideline as OLID (Zampieri et al., 2019a), where a tweet is labeled as offensive (OFF) if it contains any form of profanity or targeted offense, and non-offensive (NON) otherwise. However, instead of the binary label, two numerical scores are provide for each tweet -µ and σ. µ represents the average of the confidences predicted by the models for belonging to the positive class (OFF). σ is the confidences' standard deviation.
In order to find a suitable decision boundary, we binned µ and σ and analysed their respective distribution using histograms for the confidence levels (  Figure 1 shows the dataset has significantly more negative examples (NON) than positive (OFF), and that mid-range values of µ have a higher standard deviation σ.
We inspected the tweets with mid-range values of µ (0.4 < µ < 0.6) and high standard deviation σ (> 0.3) and found various examples of misclassification, assuming that the classification threshold was set at µ = 0.5. The most prominent being that many tweets containing profanity had µ < 0.5, as shown in Table 4 of Appendix. Additionally, self deprecating comments were often mislabelled as being offensive and general negative comments on society or environment were also incorrectly marked as offensive, see Table 5 of Appendix. As tweets with mid-range µ and / or high σ appear to be misleading, we hypothesised that training with datasets without these uncertain cases would improve the results. We created two subsets by under-sampling the mid-range values of µ (0.4 < µ < 0.6) to see if a more selective sampling would improve the results, one containing about 6M samples by just under-sampling the uncertain region (SOLID A) and another with only 1M samples (SOLID B).
SOLID A was created by removing majority of records with mid-range µ as well as removing records with σ > 0.2. The distribution is shown in middle column of Figure 1. The resulting dataset contains 6,464,288 records, roughly 72% of the original dataset volume.
SOLID B was created with the intention of balancing dataset by sampling equal number records from both high and low µ ranges. Records with σ > 0.2 were removed and only a small number of tweets with mid-range µ were kept. The distribution is shown in the right column of Figure 1. The resulting dataset contains 1,030,000 records, roughly 11% of the original dataset volume.

Offensive Language Identifcation Dataset (OLID)
The Offensive Language Identification Dataset (OLID) 2 was created for the OffensEval 2019 shared task. The training set contains 14,100 manually annotated tweets, where a tweet was labelled as offensive (OFF) if it contains any form of profanity or targeted offense, either veiled or direct, and non-offensive (NON) otherwise. The ratio of OFF to NON is roughly 1 to 2. The quality of this dataset is better and more reliable than SOLID, however it is almost 650 times smaller.

Jigsaw Unintended Bias in Toxicity Classification Dataset (Kaggle)
Th Kaggle 2019 Toxicity Classification Dataset (Kaggle) 3 dataset contains over 1.8 million public comments from online news discussions. This dataset was created with the aim of reducing unintended bias in toxicity classification as a result of identity mentions. The data has been labelled with identity mentions, such as Muslim, Gay or Black, and a toxicity score (TARGET) that represents the faction of human annotators who believe the post is toxic. We decided to include this as it is a large dataset where each comment has been reviewed by up to 10 annotators, and its content could prove useful in reducing false positive errors.

Profanity Check (Profanity)
A simple text search of the word "fuck" in SOLID returned 268,845 matches, roughly 3% of all tweets. Out of these, 2.3% were misclassified as non-offensive. Figure 3 of the Appendix compares the distribution of the confidence scores µ and σ for profanity tweets containing the word "fuck" against those in the whole dataset. Most of the uncertainty of the predictions of profanity tweets coincided with the misclassification (Figure 3, top-left). This suggests a need to increase our model's sensitivity to profanity words. Rather than using a dictionary based approach (Han et al., 2019), we decided to use a Python library, Profanity-check 4 , that checks for profanity and offensive language in text. It was built using a SVM model trained on 184,354 records from a twitter dataset 5 and a Wikipedia dataset (Zhou, 2019) 6 .

Ensemble Model
We used the HuggingFace's implementation of pre-trained BERT base-uncased model (Wolf et al., 2019) as the basis for each model, and fine-tuned the results for each dataset independently. The learning rate was set at lr = 2e −5 , with Adam Optimizer and 5 % of the samples were used for warm up with linear schedule and batch size of 32. The classifier thresholds were set to 0.5 for all models. Starting from the pre-trained model, we fine-tuned separate models using the datasets described above (SOLID, OLID, Kaggle, Profanity) and fine tuned the results over 2-3 epochs. We combined all four models into a ensemble trained with different approaches, a simple average, using gradient boosting (Mason et al., 1999), AdaBoost (Freund and Schapire, 1995), SVMs (Vapnik, 1995) with a linear and Radial Basis Function (RBF) kernel, and logistic regression. In Table 9 in the Appendix we show the values of hyper-parameter tuning done for each model.

Dev Set Test Set
Model macro-F1 P R acc. macro-F1 P R acc.

Experimental Results
We created train/dev splits with the sizes specified in Table 8 for hyper-parameter tuning for learning each model individually. We further used OLID test set as the validation set for model evaluation and selection. We report the official metric for this task macro-F1, giving equal importance to precision and recall, as well as equal weighting on the minor and major classes. Due to the imbalanced data, the performance of the model on minority class was particularly important. Table 2 details the classification results of each model independently. We report the validation scores (OLID Test) and the test scores (SOLID Test). In addition to macro-F1, we also also included macro precision (P), recall (R) and accuracy (acc.). OLID and SOLID models show better recall, while Kaggle and Profanity models had better precision. This is also reflected in the confusion matrices on Figure 2, where OLID and SOLID had significantly less false negatives while Kaggle and Profanity have far less false positives.
Contrary to our initial hypothesis, the models trained from under-sampling the uncertain partition of SOLID did not perform as well as the model built using the whole set. We posit that further tuning of the optimal threshold µ and σ for under-sampling, may influence the results. Possibly apply a more aggressive under-sampling could increase precision scores. Due to time constraints we leave further exploration as future work. Seeing that each model has different strengths, we further experiment with various ensemble techniques and combine these four models -OLID, SOLID, Kaggle and Profanity.
We reported the best model on the dev set (OLID Test). The average weighted ensemble (Avg.) used equal weight for each of the four component models. For the grid search ensemble (Grid), we searched weights from 0 to 1, in steps of 0.1. This resulted in over 1000 combinations with the same best results, possibly over-fitting due to the small dev set size (860 examples). For the remainder of the ensembles, we used 10 fold cross validation combined with grid search on the hyper-parameters to find the best parameters for the ensembles. Table 9 in the Appendix summarizes the hyper-parameters we explored.
We report the model ensemble results in Table 3. 8 We submitted the Avg ensemble, since this was the  Table 3: Results of ensembles, reported macro F1, precision (P), recall (R) and accuracy (acc.) for dev and test sets. Bold values show the best performant models only we had experimented at the time of submission. This model was also the best ensemble in the test set.

Conclusions and Future Work
There are offensive language classification datasets that portray different annotation guidelines and different annotators' subjectivity. In this world we take advantage of the heterogeneous nature of these datasets to improve the performance of offensive language identification, by combining several models in an ensemble. This showed the importance of having training data that is reliable and diverse enough to capture different types of scenarios, and the potential benefit of combining and consolidating those datasets. We hypothesised that under-sampling uncertain and possibly mis-classified tweets we could improve the performance of classification algorithms, but the results so far have proven to be inconclusive. We leave as future work more aggressive under-sampling schemes to access the consequences of only training on highly confident predictions, and differ additional semi-supervised strategies to improve results using SOLID dataset. We show that using different sources for training offensive language classification tasks helps improve the quality of the predictions.

B
Distribution of examples containing f**k word Figure 2 compares the distribution of the confidence scores (µ) and (σ) for profanity tweets containing the word "fuck" against those in the whole dataset. The box-plot on top left show that tweets with µ between 0.30.4 have a larger σ, indicating more uncertainty in the classification. The SOLID-AVG CONF(µ) histogram in the middle left also shows some tweets with the word "f**k" had µ 0.5 Figure 3: Distribution of f**k words in SOLID