UoB at SemEval-2020 Task 12: Boosting BERT with Corpus Level Information

Pre-trained language model word representation, such as BERT, have been extremely successful in several Natural Language Processing tasks significantly improving on the state-of-the-art. This can largely be attributed to their ability to better capture semantic information contained within a sentence. Several tasks, however, can benefit from information available at a corpus level, such as Term Frequency-Inverse Document Frequency (TF-IDF). In this work we test the effectiveness of integrating this information with BERT on the task of identifying abuse on social media and show that integrating this information with BERT does indeed significantly improve performance. We participate in Sub-Task A (abuse detection) wherein we achieve a score within two points of the top performing team and in Sub-Task B (target detection) wherein we are ranked 4 of the 44 participating teams.


Introduction
Offensive language is pervasive in social media in this day and age. Offensive language is so common and it is often used as emphasis instead of its semantic meaning, because of this, it can be hard to identify truly offensive content. Individuals frequently take advantage of the perceived anonymity of computermediated communication, using this to engage in behaviour that many of them would not consider in real life. Online communities, social media platforms, and technology companies have been investing heavily in ways to cope with offensive language to prevent abusive behavior in social media. One of the most effective strategies for tackling this problem is to use computational methods to identify offense, aggression, and hate speech in user-generated content (e.g. posts, comments, microblogs, etc.).
The SemEval 2020 task on abuse detection (Zampieri et al., 2020) aims to study both the target and the type of offensive language that has not been covered by previous works on various offenses such as hate speech detection and cyberbullying. In this paper, we focus on the first two sub-tasks: Sub-task A, that focuses on offensive language or profanity detection, a binary classification problem wherein the objective is to determine if a tweet is offensive or not. Sub-task B, which focuses on the identification of target presence, also a binary classification problem, wherein we are required to determine if a tweet is targeted at someone or something or not.

Boosting Pre-trained Representations
Recent Natural Language Processing (NLP) systems have focused on the use of deep learning methods that take word embeddings as input. While these methods have been extremely successful on several tasks, we believe that information pertaining to the importance of individual words available at a corpus level might not be effectively captures by models that use pre-trained embeddings, especially given the small number of training epochs (usually 3) used. We hypothesise that deep learning models, especially those that use pre-trained embeddings and so are trained on a small number of epochs, can benefit from corpus level count information. We test this on Sub-Task A using an ensemble of BERT and TF-IDF which outperforms both the individual models (Section 5.1).
For sub-task B, we hypothesise that these sentence representations can benefit from having POS information to help identify the presence of a target. To test this hypothesis, we integrate the count of part-of-speech (POS) tags with BERT. While this combination did outperform BERT, we found that a simpler modification to BERT (i.e. cost weighting, Section 3.5) outperforms this combination.

Related Work
The Offensive Language Identification Dataset (OLID) was designed by Zampieri et al. (2019) for the 2019 version of this task as there was no prior work on this task before then. OLID is made up of 14,100 tweets that were annotated using experienced annotators but suffered from limited size, especially class imbalance. To get around this, OffensEval 2020 made use of the Semi-Supervised Offensive Language Identification Dataset (SOLID) (Rosenthal et al., 2020).

Prior OffensEval Systems
Based on the results of OffensEval 2019, it seems that BERT is itself very powerful and it does relatively well for all of the 3 sub-tasks. In this section, we examine some of the best performing models on their techniques that we refer to for our methods.
Nikolov and Radivchev (2019) use a large variety of models and combined the best models in ensembles. They did pre-processing on the tweets by separating hashtag-ed tokens into separate words split by camel case. Stop words for the second and third sub-task were filtered because certain nouns and pronouns could contain useful information for the models to detect targets. Due to the class imbalance in the second and third sub-task, they used a variation of techniques to deal with this imbalance. They used oversampling by duplicating examples from the poorly represented classes. They also changed the class weights to provide more weight to the classes that are poorly represented. They also modified the thresholds that were used to classify an example, instead of having equal split for binary classes of 0.5, they shifted the boundary to accommodate the imbalance. Ensemble models were found to have over-fit the training data compared to BERT which had the best generalisation. Their BERT submissions were able to achieve 2 nd for the first sub-task and 1 st for the last sub-task.
Similarly, work by  mostly looked at pre-processing inputs before feeding it to BERT. It seems that pre-processing for BERT works very well in terms of improving its results. They also used hashtag segmentation, other techniques includes emoji substitution, they used a emoji Unicode to phrase Python library to increase semantic meaning in tweets. With just pre-processing alone they were able to achieve 1 st place for the first sub-task.
A significantly different method was used by Han et al. (2019), who used a rule based sentence offensiveness calculation to evaluate tweets. High and low offensive values are automatically classified as offensive or non-offensive, otherwise it follows a probabilistic distribution. For sub-task B using the sentence offensiveness model, they outperformed other systems that used deep learning or non-neural machine learning. This is a interesting find as it shows that traditional techniques such as using rule based models for target classification can be very successful compared to deep learning methods.

System Overview
For sub-task A, we test three models: a standard neural network that uses TF-IDF features, BERT and the ensemble of these two. For sub-task B we use noun counts, BERT and the ensemble of both.

TF-IDF
In order to incorporate global information into our model, we need to employ a technique that does so and TF-IDF does this well. Using TF-IDF, we will be able to identify keywords that helps us to distinguish between offensive/non-offensive tweets which offensive tweets will tend to have more offensive words while non-offensive tweets usually contains more neutral-toned words. Since we use TF-IDF as our input features to be combined with BERT, we have a neural network so that when we are training the combination of the models, the neural network will enable us to maintain learning from training compared to non-neural machine learning techniques.

BERT
The reason for picking BERT (Devlin et al., 2018) is because BERT has outperformed several similar techniques that provides sentence level embeddings such as BiLSTM and ELMo (Peters et al., 2018). It has also shown to be very effective at doing all the sub-tasks in the previous year evaluation (Zampieri et al., 2019). We can see that it has both strengths in generalisation and also able to handle contextual based evaluations well.

Ensemble Model
Ensemble techniques have shown to be effective in reducing variance in the prediction and at making better predictions, this can be achieved for neural networks having multiple sources of information (Brownlee, 2018). We will be using an ensemble model to combine individual models into one. Just using BERT alone will provide us sentence level information, but if we combine BERT features and TF-IDF features, we can have access to both sentence and corpus level information which is the goal of our hypothesis. This ensemble model is created by concatenating the sentence representation of BERT to the features generated by the TF-IDF model before then using this combined vector for classification. In practice, this translates into calculating the TF-IDF vector for each sentence and concatenating it to the corresponding BERT output. This vector is then fed to a fully connected classification layer. Both BERT and the TF-IDF weights are updated during training.

Noun Count As Features
We have seen the success of rule based method for sub-task B that achieved significant performance compared to machine learning techniques. (Han et al., 2019) has shown that using a manually annotated offensive list of words that provides a measure of the strength of offensiveness is effective. Since targets are very likely to be identified as nouns or pronouns in the tweets, we can identify the presence of a target if we have a count of part-of-speech tags such as 'PRP', 'NP'.

Cost Weight Adjustments
An analysis of the datasets showed that for all sub-tasks, there were large class imbalances. We follow the method described by Tayyar Madabushi et al. (2019) to modify the cost function to allow poorly represented classes to have more impact when calculating the cost of error. They show that other techniques such as data augmentation through oversampling does not improve the performance of BERT. We use cost weighting for both tasks.

Experimental Setup
For each of the sub-tasks we participate in, we split our training set into a training set and development set in a 4:1 ratio. Our test set is the evaluation set for SemEval-2019. We submit the best version of these experiments to SemEval-2020. Also, in each case, we first experiment with BERT, then by adding additional parameters to BERT and finally by use of cost-weights. All ensemble models were created at the embedding layer by appending additional features to BERT embeddings before then using a fully connected layer for classification.

Sub-Task A
Our setup for sub-task A was to pre-process using stemming and NLTK's tweet tokenizer for the TF-IDF features, where we only consider the top 6000 highest term frequency words to accommodate memory limitations. With BERT, we found that stemming or lemmatization does not help to improve the results, so our input uses BERT's default tokenizer with a maximum sequence length of 64. We used the English dataset provided for training, which consists of nine million examples. Unfortunately, due to memory constraints, we were unable to use this entire dataset for training and the final model used just 10% of this dataset. We also applied cost weighting to account for class imbalances. We used a learning rate of 5e-6, batch size of 32. We found the best results to be within 1 to 2 epochs.

Sub-Task B
Our setup for sub-task B was using the OLID dataset, as we found that the new dataset provided had a high rate of misclassified label. We used NLTK's POS tagger to extract the tags for our noun count features and we extract the count of 'NNS' and 'PRP' tags as they give the most information about target presence. We used a learning rate of 5e-5, batch size of 32. We found the best results to be within 20 epochs as the dataset is small, we needed to adjust for the step size decrease.

Results and Analysis
We present our overall rankings on each of the two sub-tasks in Table 2. While our rank on the first task is not very high, we note that it is within 2 points of the top scoring team -it should be emphasised that we achieve this result by use of only 10% of the the available training data due to GPU memory limitations. We rank much close to the top on Sub-Task 2 with a rank of 4 amongst a total of 44 submissions.   Our analysis of the training and test data using the Wilcoxon signed-rank test as described by Tayyar Madabushi et al. (2019) shows that the training and development sets are different enough to warrant the use of cost-weighting (Section 3.5). To this end we introduce cost weighting to each of the three models described above and the results of these experiments are presented in Table 4  We observe that adding the optimal cost weights to poorly represented classes significantly improves the performance of all models. The ensemble model, however, still outperforms either of SNN or BERT, despite a large increase in performance for BERT after adding cost weights.

Rank System
As mentioned we were only able to train our BERT and ensemble models with 10% of the training data. The performance of our models can be further improved given GPU resources as shown in Table 5.

Sub-Task B
For sub-task B, as mentioned in Section 1.1, we hypothesise that these sentence representations can benefit from having POS information to help identify the presence of a target. To test this hypothesis, we integrate the count of part-of-speech (POS) tags with BERT. We use the OLID dataset for training and last year's evaluation set as the test set. The best performing model is used to make predictions for submission to this year's competition. We present the results of our experiments in

Conclusion
We show that incorporating corpus level information does help improve the performance of BERT. We achieve competitive results using just 10% of the available dataset and would like to test the limits by training with the full dataset. Our experiments also show that noun counts do help boost the performance of BERT, but not as much as cost-weighting.