KAFK at SemEval-2020 Task 12: Checkpoint Ensemble of Transformers for Hate Speech Classification

This paper presents the approach of Team KAFK for the English edition of SemEval-2020 Task 12. We use checkpoint ensembling to create ensembles of BERT-based transformers and show that it can improve the performance of classification systems. We explore attention mask dropout to mitigate for the poor constructs of social media texts. Our classifiers scored macro-f1 of 0.909, 0.551 and 0.616 for subtasks A, B and C respectively. The code is publicly released online.


Introduction
The research community has put much effort over the last few years in developing automated detection methods to combat hate speech in social media (Schmidt and Wiegand, 2017). But the complex nature of this phenomenon shows that it has no single solution. The second edition of the OffensEval Workshop , titled OffensEval-2020, is organized with the goal of promoting further research in this domain.
The OffensEval-2020 workshop features shared subtasks in five different languages. We participated in the English language which consists of three subtasks (A, B and C). Each subtask is a breakdown of the taxonomy of offensive content. Subtask A (Offensive language identification) is the classification of a post as offensive [OFF] or not offensive [NOT]. Subtask B (Automatic categorization of offence types) determines whether a post containing an insult or a threat is targeted towards an individual, a group, or other [TIN] or simply contains non-targeted profanity and swearing [UNT]. And Subtask C (Offense target identification) is the identification of the target of an offensive post. The targets being individual [IND], group [GRP] and other [OTH].

Data
The dataset for the English edition of SemEval Task-12 is compiled by . It is a large scale dataset of loosely labelled text samples, i.e, each sample is associated with an Average Confidence measure and it's Standard Deviation. Average Confidence is the average of the confidences with which numerous supervised models predicted an instance as belonging to the positive class for a subtask. The positive class is OFF for subtask A, and UNT for subtask B. The dataset of subtask C has no positive class, instead, the average confidence of each class is given. In our study, we assign an instance to a class if its average confidence for that class is greater than or equal to 0.50. The label counts for each subtask is given in Figure 1. Model Ensembling refers to the method of constructing a single classifier from a collection of different classifiers (Dietterich, 2000). However, creating ensembles of large Deep Neural Networks (DNNs), like the ones used in this study, is an expensive task. It is generally not possible to train multiple models with limited time and GPU resources. Hence, based on the work by Chen et al. (2017), we used a simple Checkpoint Ensembling method for creating the transformer-based ensemble classifiers used in this study.
In Checkpoint Ensembling, a copy of the model is saved at each checkpoint. These copies are later combined in some fashion to make the classification (see Figure 2a). In contrast, a traditional ensemble usually combines different models (see Figure 2b). In our checkpoint ensembling approach, we save the dev set predictions and weights of the DNN models at each epoch and apply Algorithm 1 to determine which models to use for the ensemble. Algorithm 1 uses the dev set predictions to create a list of models to use in the ensemble. Algorithm 1 picks a model if using its predictions improves the metric (macro-f1 for OffensEval), otherwise not. Algorithm 1 is called twice with reverse set to T rue and then F alse. If the ensemble doesn't improve the metric, we can simply choose the best model found during training. After determining the models, we apply Algorithm 2 to get the final predictions. Algorithm 2 simply adds the predictions of the classifying layer of the chosen models and uses argmax along each row to get the final prediction.

Classifiers
This section describes the classifiers built for each of the subtasks of OffensEval-2020. All of the classifiers described below follow the basic transfer learning procedure as shown in Figure 3. The classifiers and their training routines are written using PyTorch 1 (Paszke et al., 2019). The data splits are made such that the percentage of samples for each class is the same in each split. The random seed is set to 42 wherever applicable. The code has been made public for reference 2 .  E ← E + p 7: end for 8: preds ← Index of max element in each row of E GPT-2 is a large transformer-based (Vaswani et al., 2017) language-model developed by Radford et al. (2019). Its is trained on 40GB of internet text and has many capabilities, the main being able to generate synthetic text. The text generated of such high quality that it can easily be mistaken for being human-written.
Due to GPT-2's extremely large-size (1.5 billion parameters), it requires a lot of time and resources to train. So, we used DistilGPT-2 instead. Distil* ) is a class of compressed transformerbased models that has faster training and inference time while being small in size. These models are meant to enable the use of large high-performance models in a production environment. The authors of distil* show that it retains up to 97% performance of the original models. Also, Distil* enabled us to work on the massive datasets of OffensEval-2020 with high-performance base models with modest computing resources. This model is used for building the classifier for subtask A.
We used 70% of the data for training and 13633 samples for validation. We used a small subset for validation as we did not have enough time to obtain inferences on the entire remaining 30% (≈ 2.7 × 10 6 samples). For training the classifier, we first converted each text sample into a sentence-matrix by extracting 786 − dimensional word embeddings from pre-trained DistilGPT-2. These were then fed into a dense layer having c units which makes the classification. Here c is the number of classes. We fine-tuned the entire model using a cross-entropy loss-function with a small learning rate of 1e − 4 for 5 epochs and applied Checkpoint Ensembling as described in Subsection 4.1. For optimizing the model, we used Ranger which is a combination of two optimizers, RAdam (Liu et al., 2019a) wrapped with LookAhead (Zhang et al., 2019). We set the (k, α) parameters of the optimizer to (5,0.5). The batch-size and maximum-sequence length were set to 399 and 64 respectively. Checkpoint ensembling improved the dev set macro-f1 score from 0.9656 to 0.9663.

Subtask B: Ensembled RoBERTa
Liu et al. (2019b) identified the short-comings of BERT (Devlin et al., 2019) and introduced RoBERTa, a robustly optimized version of BERT. We used it's pre-trained version for subtask B. We coupled Checkpoint Ensembling with the pre-trained Roberta Sequence Classifier by . The classifier was trained for 20 epochs with early stopping patience set to 4. We used 85% of the data as train set and the rest as dev set. The maximum sequence length was set to 245 and batch size used was 128. The other parameters and hyper-parameters were kept the same as that of subtask A. Checkpoint ensembling improved macro-f1 on the dev set from 0.8881 to 0.8907.

Subtask C: Ensembled DistilRoBERTa with Attention Mask Dropout
This classifier was built using the pre-trained distilled version of Roberta Sequence Classifier by , coupled with a slightly modified Checkpoint Ensembling. Instead of directly using the dev set predictions, we first sorted it in decreasing order of their macro-f1 scores. We also applied a drop out to the attention masks. Attention masks specify which of the tokens of the sentence the model should attend to. With probability p, we randomly dropped d% of attention masks of the tokens in the sentence. We set p to 0.50 and d to 30%. We used attention mask dropout in an attempt to mitigate the poor grammar which is often encountered in social media texts. Similar to the original dropout technique (Hinton et al., 2012), it was only used during training. The maximum sequence length was set to 245. The pre-trained model was fine-tuned using a cross-entropy loss function for 20 epochs with early stopping patience set to 4. We used a batch-size of 120 and learning rate of 1e − 04 with Ranger optimizer. Like in previous classifiers, we set the (k, α) parameters of the optimizer to (5, 0.5). Here 95% of the labelled data was used for training and the rest as dev set. Ensembling improved the dev set macro-f1 from 0.8148 to 0.8281. The dev and test set results for each of the classifiers are given in Table 1. The test set contained 3887, 1422, 850 text samples for subtasks A, B and C respectively. Ensembled DistilGPT-2 was able to cross the 0.90 mark. Perhaps using the full dataset instead of just 70% might have resulted in a better score. This shows the need for techniques like model quantization to deal with massive datasets. Ensembled RoBERTa and Ensembled DistilRoBERTa performed quite poorly. They clearly overfit the data as the difference between the dev set and test set f1 scores is quite large. Later experimentation revealed that attention mask dropout in subtask C hurt the performance of the model. Without it, the dev set macro-f1 was 0.8275 and sorted checkpoint ensembling improved the score to 0.8336. Ensembled DistilGPT-2 for Subtask A was able to correctly classify every offensive sample, except for two which were predicted as not offensive. Those two samples being • "@USER @USER Dehumanize? He barely has a reflection of human" and • "@USER @USER 'Respect the result' is a mendacious soundbite trotted out by charlatans daily". This was perhaps due to the complex use of language, use of rare words such as mendacious, trotted and absence of typical profane words. As seen from the confusion matrix in Figure 4a, this classifier had a bias towards the offensive class.

Results and Error Analysis
Ensembled Roberta for Subtask B performed poorly. It failed to distinguish properly between targeted and untargeted samples. Untargeted samples such as • "@USER Yeah my deck was insane that run, but normally I suck lol " were predicted as targeted. We found that samples which contained the "@USER" token were mostly mistaken as targeted. The confusion matrix is given in Figure 4b.
Ensembled DistilRoberta for Subtask C misclassified most of the samples of 'other' class as targeting individuals, as apparent from its confusion matrix in Figure 4c. The same effect can be seen in the group class. Also, we found that this classifier made many mistakes where emojis are used. For example • " I'm the only to get the job done ..... ion kno a nigga dat can cover for me" is misclassified as belonging to the individual class rather than group targeted. In this example, the ' ' emoji is being used to denote "one". The model was unable to get the context from the emojis.

Discussion and Conclusion
In this work, we built three different transformer-based classifiers for the three subtasks of OffensEval 2020. We created ensembles using Checkpoint Ensembling. Although checkpoint ensembling improved the performance of the classifiers, the improvements were quite small. But, considering how cheap and simple the method is, it can't be dismissed completely. We found that attention mask dropout did not work as expected. We feel that more tuning of the p and d hyper-parameters might have been necessary to get it to work properly. In future work, we would like to explore and evaluate more sophisticated ensembling methods such as the Meta Classifier by Malmasi and Zampieri (2018).