DuluthNLP at SemEval-2021 Task 7: Fine-Tuning RoBERTa Model for Humor Detection and Offense Rating

This paper presents the DuluthNLP submission to Task 7 of the SemEval 2021 competition on Detecting and Rating Humor and Offense. In it, we explain the approach used to train the model together with the process of fine-tuning our model in getting the results. We focus on humor detection, rating, and of-fense rating, representing three out of the four subtasks that were provided. We show that optimizing hyper-parameters for learning rate, batch size and number of epochs can increase the accuracy and F1 score for humor detection


Introduction
Humor detection poses a challenge to humans, not least because of the mix of irony, sarcasm, and puns which underlie humor. To understand the funniness of humor requires a certain grasp of context, culture and, for some, even country.
If rating the funniness of humor is any challenge, ranking its offensiveness is even more so, especially when doing so requires an appreciation of the sensibilities of the humor target -whether race, religion, and/or gender -and, in most cases, context.
It is little wonder humor detection has been central to NLP tasks in the past few years. The last few SemEvals have featured tasks focused exclusively on either detecting the funniness of humor (Hossain et al., 2020;Van Hee et al., 2018;Potash et al., 2017) or detecting offense (Zampieri et al., 2019). What SemEval-2021 task 7 seeks to do is combine the detection of both humor and offense for a given corpus.
Our approach uses pretrained RoBERTa model (Liu et al., 2019b) trained on a RoBERTa classifier implemented by HuggingFace (Wolf et al., 2019). The intuition here is that RoBERTa model achieves state of the art performance for tasks requiring contextual information. Our goal is to measure the effect of varying three hyperparameters -batch size, learning rate, epoch size -whilst maintaining default values for others. Our results show that varying the three hyperparameters can increase performance for humor detection, humor ranking, and offense rating. The codebase for our participation of this SemEval Task is available on github 1

Related Work
Earlier works on humor detection and rating showed modest gains. With the advent of attention mechanism (Vaswani et al., 2017), though, and the transformer model a few years later (Dai et al., 2019), not only has interest in the NLP community on humor detection has soared, but performance on humor and offense detection has increased (Weller and Seppi, 2019).
This has been particularly so in the last few years, where humor detection and offense rating have been featured in some of SemEval tasks. SemEval-2019 Task 6 on offense rating (Zampieri et al., 2019) attracted 800 participants and 115 submissions, the interest prompting a second SemEval-2020 Task 12 the following year (Zampieri et al., 2020). Around the same period, humor rating have attracted similar interest (Hossain et al., 2020) in SemEval Tasks. To the best of our knowledge, however, SemEval-2021 Task 7 (Meaney et al., 2021) is the first to measure humor and offense for a given task.
Most of the winning teams (Morishita et al., 2020;Rozental and Biton, 2019;Wiedemann et al., 2020), for both humor and offense rating alike, implemented BERT and its variants, including Albert and RoBERTa, in their model, an approach that tended to yield the best results. And more often than not, the teams exploit ensembles of BERT, GPT-2, RoBERTa and their variants (Morishita et al., 2020), whilst others stick to a single pretrained model. Our approach in this task is to use RoBERTa model, an approach that will fine-tune a select number of hyperparameters and to measure the model performance for every change of hyperparameter set.

System overview
In this section, we review our system's adoption of the pretrained RoBERTa model (Liu et al., 2019a) for SemEval tasks. We also describe the Bayesian hyperparameter optimization technique, which we used in our hyperparameter sweeps for selecting optimal values for learning rate, batch size, and epoch cycles.

Model description
Our system's adoption of RoBERTa model is based on its ability to achieve state-of-the-art performance for most NLP tasks with minimal effort, including, in our case, humor detection. The RoBERTa model, itself a re-implementation of BERT (Devlin et al., 2019), is first pre-trained on unlabeled text corpus and subsequently fine-tuned on downstream tasks with labeled data.
The RoBERTa model is a significant improvement over the BERT model, and it differs from BERT for its usage of dynamic masking for training, Next Sentence Prediction (NSP), and a larger mini-batch size (which, it has been observed, correlates with performance (Liu et al., 2019a)).
Again, the RoBERTa model outperforms BERT for its size and diversity of data used in pretraining, with its 160GB of training data drawn from multiple sources compared to BERT's 16GB of training data.
Our system implements the RoBERTa model with a classification layer on top using Hugging-Face transformer model 2 .
We adopt Bayesian optimization to automate the selection of optimal hyperparameter values for the training and evaluation of the three Subtasks. Details of the Bayesiann optimization method are found in Appendix A

Sweeps
We use Bayesian optimization to run hyperparameter sweeps for our model, but not before manually selecting a sensible list of hyperparameter values in fine-tuning the RoBERTa model (Liu et al., 2019a) on the SemEval tasks, including a learning rate of 0.000025, a batch size of 4, all for 16 epochs. The initial weights are based on standards followed by BERT and RoBERTa (Devlin et al., 2019;Liu et al., 2019a). The remaining parameters are based on open source implementation by HuggingFace 3 (Wolf et al., 2020).
Whilst the initial approach of selecting sensible defaults for hyperparameters achieved state of the art results, the random, manual process was painful, the results sometimes unpredictable. Applying Bayesian optimization, however, in running a sweep over a range of hyperparameter values helped in the selection of hyperparameters, including epoch size, batch size, and learning rates, across the 24 layers of RoBERTa LARGE model. Our system's implementation of the Bayesian Optimization is based on the open source wandb client by Weights & Biases 4 (Biewald, 2020)

Experimental Setup
In this section, we present the experimental setup for the our model and hyperparameter sweeps for the SemEval tasks.

Implementation
For all experiments for the RoBERTa model (using RoBERTa BASE and RoBERTa LARGE varients) , we use the PyTorch 5 implementation of it in the HuggingFace transformer open source library 6 (Wolf et al., 2020) together with the simpletransformer 7 (Rajapakse, 2019) wrapper library. We maintain the default weights and hyperparameters whilst only changing the learning rate, batch size and finetuning for different values of epochs. All experiments were run on V100 GPUs with 16GB memory.

Data Processing
The training and evaluation data, as shown in Ta  The annotators for the dataset for the tasks were a diverse group of individuals from differing age group (18-70), genders, political views and income levels -their backgrounds reflecting their perceptions of jokes or humor. For each text in the dataset, annotators were asked to rank as either humorous or not, and to rate the humor level on a scale of 1 to 5.
The subjectivity level of each text is also captured as a controversy score. Each text is labeled as controversial if the variance of its humor rating is greater than the median variance of all texts. Otherwise, it is labeled as not controversial.
As a way of combining humor and offense detection in the same task, a first in SemEval tasks, annotators were asked to classify humor as either offensive or not, and, if offensive, to rate the offensiveness on a a scale of 1 to 5. Non-offensive humor received a zero rating.
Overall, the SemEval task divides into four subtasks. Task 1a, a binary task, predicts if a given text should be considered humorous. The second Task 1b, a regression task, assigns a rating between 1 to 5 to text considered humours, and 0 otherwise. The third Task 1c, itself a binary task, gives a controversy score to a text. The fourth task predicts the general offensiveness of a text on a scale of 0 to 5. Our system's implementation only experiments with three of the Subtasks, including Task 1a, Task 1b, and Task 2a.

Hyperparameter tuning
Our approach to hyperparameter tuning involves two steps -one manual (implemented during the evaluation phase) , the other using Bayesian optimization (implemented during post-evaluation). In the first step, though, we experiment with a range of hyperparameter values on Task 1a, and the results applied to train our model on the various subtasks.
But during the second step, implemented during the post-evaluation phase, we implement hyperparameter sweeps on each task.
In the first step, we manually select from a range of tunable hyperparameters, with batch sizes ∈ {4, 16}, learning rates ∈ {2e − 5, 4e − 5, 1e − 4} and we fine-tune for epochs in ∈ {6, 9, 12, 16} . The remaining hyperparameters, including dropout rates, and the parameter weights, are based on the default values for RoBERTa model implementation in the HuggingFace transformer library.
Using the results of the first step as our initial defaults for batch size and learning rate, we consider a fine-grained hyperparameter sweep using the Bayesian optimization across the 24 pretrained layers of the RoBERTa LARGE model. We select a range of learning rates between 0 & 1e − 3 for the pretrained layers. We fine-tune for a range of 6 to 40 epochs, applying early stopping and using accuracy as the evaluation metric on the valuation set for Task 1a, and RMSE as the evaluation metric for Task 1b and Task 2a.
Runs on hyperparameter sweeps are taken on all the three Subtasks -Task 1a, 1b, and 2a. We then use the results of learning rates across the pretrained layers, together with the batch size to train our model for the selected number of epochs on each Subtask. Table 5 shows the results of each hyperparameter sweep.

Results
In this section we present the results of our Se-mEval tasks and our analysis for each step.We present, in the first step, our baseline method and results. We then follow up with the results obtained evaluation phase; here, we analyse the impact of the manual sweep and finetuning on the pretrained RoBERTa model on the subtasks. In the last step, and using the results of the codalab scores, we analyse the impact of hyperparameter sweeps on the scores during the post-evaluation phase.   Table 3: Post-evaluation scores. These were scores generated during the post-evaluation stage after multiple hyperparameter sweeps.

Baseline
For our baseline method for the regression tasks, we use a very simple linear regression class by scikit-learn. For the classification task, though, we use logistic regression, also by scikit-learn, but with binarized ngram counts, a method proposed by Wang and Manning (2012). The baseline result for the classification task is 89%. RMSE baseline results for the regression tasks, for both Task 1b and 2a, are 0.54 and 0.74 respectively.

Official Evaluation
As the evaluation results in Table 2 shows, the F1-Score of 0.939 for Task 1a is very high, and that is even for manual values of learning rate, epoch and and batch size. What this shows is that using the recommended learning rate and batch size to fine-tune pretrained RoBERTa model on humor classification tasks can achieve very high results. On the the hands, the same hyperparameter values achieved average RMSE metric score for Task 1b and average F1 score for Task 2a, which suggest that our approach of using the same hyperparameter values for all subtasks is not working.

Post Evaluation
During the post-evaluation phase the team carried out an extensive hyperparameter finetuning with the bayesian optimization. Table 3 also shows substantial gain in the F1-score, 0.95, for Task 1a,

Error Analysis
In an attempt to measure our model's predictions against the annotations by humans, we calculate the confusion matrix, comparing the predicted re- sults with the values in the gold test for Subtask 1a. Figure 1, the confusion matrix during the evaluation phase, shows comparable results for false positives and false negatives. In Figure 2, however, the number of false positives (34) are almost twice the number of false negatives, which is so because the train set has more label 1 data (4932) than label 0 data (3068), a slight difference that can lead to the false positives. Again, the total number of true positives (596), almost twice the number of true negatives (351), shows the model is biazed towards positive labels, which will make it difficult to generalize. Moreover, as shown in Table 4, about half of the number of the false positive predictions are also offensive, in part because most of the labeled texts in the train set that are labeled as humorous are also labeled as offensive.
The higher number of accurate predictions for both the evaluation phase(926) and post-evaluation phase(947) shows our model is efficient in detecting humor.
However, the RMSE scores for humor ratingthat is Task 1b, including even the improved RMSE score of 0.58 during the post evaluation phasestill lags behind the RMSE results for Task 2a , the offense-rating subtask. And what this suggest is that our model performs better with offense rating than with humor rating. And that might suggest that the higher scores associated with Task 1a are based on the model's ability to detect offense, which explains why most of the false positive texts also contain offensive content.   One major limitation of our approach was that the hyperparameter runs, during the evaluation phase, were experimented only on Task 1a, a binary classification task, and the results applied on the two regression tasks to train our model, which may explain the subpar results for the Task 1b and Task 2a. However, the steps taken during the post-evaluation phase, by independently running the sweeps on each Subtask, showed substantial increase in performance.
In addition, the RoBERTa model, as implemented by HuggingFace, is used as is, without any modification to either the classification layer on top of the RoBERTa model or any of the pretrained layers. In the future, it will be worth pursuing how modifying either of the layers will impact on humor and offense detection.
Overall, however, our system shows that getting optimal values for learning rate, batch and epoch size can yield higher performance for humor detection.

Ethical Considerations
The training of RoBERTa, along with other language models such as BERT and its variants, has been shown to be costly, both for its effect on the environment and finance (Strubell et al., 2019). Again, the embeddings used for these language models tend to amplify racial, sexist, and homophobic biases. Mindful of these tendencies, our model experiments included steps to minimize bias and reduce energy cost.
What SemEval-2021 Task 7 (Meaney et al., 2021) intends to achieve is not only to rank humor but also to rate humor offensiveness, the first of any SemEval task. To achieve this, the dataset contains as much as humor as hate, which covers racial slurs, gender bias, trans/homophobic comments, etc. Knowledge of what ranks as offensive in humorous text can help our system moderate humorous content.
To ensure that dataset used for training and development do not over-represent hegemonic viewpoints, Meaney et al. (2021), organizers for the SemEval-2021 Task 7, employed annotators from disparate backgrounds, in age, gender, political views, to ensure that humor ratings and rankings, a subjective process, reflected the varied viewpoints.
Annotators were limited to English speakers, however, which implies that the system's ability to detect and identify humor is largely reflect views inherent in the English language.
Training and testing were carried out on 1 V100 GPUs with less than 16GB of memory, a step taken to ensure minimal, if any, impact on the environment.
The model, however, is prone to classifying offensive content as humorous, which may suggest that applications based on our model will be more likely to rate as humorous any content that might be deemed offensive.