Humor@IITK at SemEval-2021 Task 7: Large Language Models for Quantifying Humor and Offensiveness

Humor and Offense are highly subjective due to multiple word senses, cultural knowledge, and pragmatic competence. Hence, accurately detecting humorous and offensive texts has several compelling use cases in Recommendation Systems and Personalized Content Moderation. However, due to the lack of an extensive labeled dataset, most prior works in this domain haven’t explored large neural models for subjective humor understanding. This paper explores whether large neural models and their ensembles can capture the intricacies associated with humor/offense detection and rating. Our experiments on the SemEval-2021 Task 7: HaHackathon show that we can develop reasonable humor and offense detection systems with such models. Our models are ranked 3rd in subtask 1b and consistently ranked around the top 33% of the leaderboard for the remaining subtasks.


Introduction
Like most figurative languages, humor/offense pose interesting linguistic challenges to Natural Language Processing due to its emphasis on multiple word senses, cultural knowledge, sarcasm, and pragmatic competence. A joke's perception is highly subjective, and age, gender, and socioeconomic status extensively influence it. Prior humor detection/rating challenges treated humor as an objective concept. SemEval 2021 Task 7 (Meaney et al., 2021) is the first humor detection challenge that incorporates the subjectivity associated with humor and offense across different demographic groups. Users from varied age groups and genders annotated the data with the text's humor and have provided an associated score for the same. It is also quite a generic phenomenon that a text might be * Authors contributed equally to the work. Names is alphabetical order. humorous to one and normal/offensive to another. Rarely has it been noticed that the same content is globally accepted as witty. To the best of our knowledge, Meaney et al. (2021) is the first initiative towards annotating the underlying humor as controversial or not. Understanding whether a text is humorous and/or offensive will aid downstream tasks, such as personalized content moderation, recommendation systems, and flagging offensive content.
Large Language Models (LLMs) have recently emerged as the SOTA for various Natural Language Understanding Tasks Raffel et al., 2019;Conneau et al., 2019;Zhang et al., 2020). However, typical day-to-day texts, where these models have shown state of the art performance, are less ambiguous than texts having puns/jokes. Training and evaluating LLMs in the context of highly ambiguous/subjective English texts would serve as an excellent benchmark to figure out the current shortcomings of these models. This paper studies various large language models -BERT (Devlin et al., 2018), RoBERTa , XLNet , ERNIE-2.0 (Sun et al., 2019) and DeBERTa (He et al., 2020) and their ensembles -for humor and offense detection tasks. Additionally, we explore a Multi-Task Learning framework to train on all the four sub-tasks jointly and observe that joint training improves the performance in regression tasks.
We have achieved significant performance on all the subtasks and have consistently ranked ∼ 1 3 rd of the total submissions. We were ranked (1) 21 st with an F-score and accuracy of 94.8% and 95.81% respectively in Task 1a, (2) 3 rd with an RMSE score of 0.521 in Task 1b, (3) 9 th with an F-score and accuracy of 45.2% and 62.09% respectively in Task 1c; and (4) 16 th with an RMSE score of 0.4607 in Task 2. We release the code for models and experiments via GitHub 1 We organize the rest of the paper as: we begin with a description of the challenge tasks followed by a brief literature survey in section 2. We then describe all of our proposed models in section 3 with training details in section 4 and present the experimental results in section 5. Finally, we analyze our findings and conclude in section 6, and 7 respectively.

Problem Description
SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense (Meaney et al., 2021) involves two main tasks -humor detection and offense detection. The organizers further subdivide the task into following subtasks: 1. Humor detection tasks: (a) Task 1a involves predicting whether a given text is humorous. (b) Task 1b requires predicting the humor rating of a given humorous text. (c) Task 1c incorporates humor subjectivity by posing a classification problem of predicting whether the underlying humor is controversial or not.
2. Task 2 is an offense detection task and is posed as a bounded regression problem. Given a text, we need to predict a mean score denoting the text's offensiveness on a scale of 0 to 5, with 5 being the most offensive.

Related Works
Transfer Learning ULMFiT (Howard and Ruder, 2018) used a novel neural network based method for transfer learning and achieved SOTA results on a small dataset. Devlin et al. (2018) introduced BERT to learn latent representations in an unsupervised manner, which can then be finetuned on downstream tasks to achieve SOTA results. Humor & Emotion Detection Weller and Seppi (2019) first proposed the use of transformers (Vaswani et al., 2017) in humor detection and outperformed the state of the art models on multiple datasets. Ismailov (2019) 3 System Overview

Data
The challenge dataset comprises of a train set (labeled 8000 texts) and a public-dev set (labeled 1000 texts). Each text input is labeled as 1/0 if it is humorous or not and rated with the offensiveness score on a scale of 0-5. If a text is classified as humorous, it is further annotated with humor rating and classified as controversial or not. For our single-task models (Section 3.2), we train on the train + public-dev set after obtaining a suitable stopping epoch by training and validating on the train and public-dev respectively. For our multi-task models (Section 3.3), we train on 8200 texts sampled randomly from train and public-dev sets and use remaining 800 text inputs for validation.

Single Task Model
As the tasks are evaluated independently, we have explored LLMs for each task/subtask independently and will be referring to them as single task models. Inspired by Demszky et al. (2020), for each task, we add a classification (for Task 1a, 1c) or a regression (for Task 1b, 2) head on top of the pretrained models like BERT, RoBERTa, ERNIE-2.0, DeBERTa and XLNet and train the model endto-end ( Figure 1a). This ensures that the model learns features solely related to the task, enhancing the performance. Also, as we only add a classification/regression head, the number of learnable parameters does not increase much. This helps us in finetuning the model on such a small dataset for a few number of epochs avoiding overfitting and resulting in better generalization.

Multi Task Learning
Collobert and Weston (2008) demonstrated that Multi-Task Learning (MTL) improves generalization performance across tasks in NLP. The different tasks though uncorrelated, share the same underlying data distribution. This can be of great help for tasks 1b and 1c where labeled instances are far less than for task 1a or 2. Exploiting the fact that all tasks share same data distribution, we propose to learn a model jointly on all the tasks. Specifically, we consider hard parameter sharing among differnet tasks and parameterize the base models using a neural network, followed by two heads for classification and regression tasks (Figure 1b). Our base model includes LLMs like BERT, RoBERTa, and ERNIE. Contrary to the LSTM layer, which helps in learning features using all the token level embed-dings, the Fully Connected (FC) layer focuses only on the embedding of [CLS] token. Hence, having these two branches allow the model to focus on different tasks using the same sentence embedding and helps in learning enhanced embeddings for task 1b and 1c with much lesser labeled dataset.

Ensembles
Mostly LLMs differ in their training procedure, and architecture. These big language model frameworks are trained on wide set of datasets for a variety of tasks. Though, they all have comparable performance, they may still capture different aspects of the input. We try to leverage such varied informative embeddings based predictions by combining multiple models trained with different basenet using following strategies: Jointly trained Model Embeddings: All the big language frameworks have shown huge performance improvement on multiple tasks owing to their highly informative latent input embeddings. We propose to learn an ensemble leveraging diverse aspects of the input captured by varied LLMs by concatenating their latent embeddings and mapping them to low dimensional space for task prediction. We use this method in learning ensembles of single task models explained in 3.2.

Aggregation of Trained Model Predictions:
Joint-training though more informative and powerful, is a computationally intensive approach. Thus as an alternative, we use a weighted averaging of multiple pretrained models without compromising much on the performance.  using different LLMs as basenet, the aggregate outputŷ is computed asŷ = k i=1 λ i ·ŷ i where y i and λ i represents the output and weight of the i th model respectively. The weights λ i are obtained through extensive grid search on the held out validation dataset or set to a 1 k when trained on the entire dataset without a validation set. The complete approach is shown in figure 2. 2. Voting Based Classification: This is one of the most popular approach of learning an ensemble and does not involve any hyperparameters or retraining of any of the constituent models. This involves training multiple models independently and using maximum among all the predictions as the final output. For a binary classification task, the final outputŷ is by max-voting across the independent models.

Experimental Setup
We used Pytorch (Paszke et al., 2019) and Hug-gingFace (Wolf et al., 2020) library for our models, and Google Colab GPUs for training and inference. We use ADAMW (Loshchilov and Hutter, 2019) and ADAM (Kingma and Ba, 2017) optimizer with initial learning rate of 2e −5 for training single task and multi task models respectively. For each of the models we follow a dedicated training pipeline described in subsequent sections.

Data preprocessing
We split the dataset into training and validation data as described in Section 3.1. The sentences are annotated with a [CLS] token in the beginning and given as an input to the model. We performed additional experiments by removing stopwords but noticed a slight deterioration in the performance.

Loss Functions
Task 1a & 1c are instances of binary classification problem and thus have been trained using crossentropy loss. For predicting humor and offense rating i.e., Task 1b and 2, we have used mean squared error as the loss function.

Training Details
All the models are trained for n epochs where n is a hyper-parameter tuned on the validation set using early stopping criteria. For single task models, we split train data into training and validation set to learn the optimal value of n and then train the model from scratch on train + public-dev  set for n epochs. In case of multi task models, all the tasks do not converge at the same rate. Thus, we train multi task models on randomly sampled 8200 texts from train + public-dev dataset and validate on the remaining 800 texts. We use early stopping criteria on validation dataset independently for each task.

Results
We have trained multiple single task and multi task models using basenet LLMs like BERT, Distil-BERT, RoBERTa, XLNet, Albert (Lan et al., 2019), Electra (Clark et al., 2020), DeBERTa, and ERNIE-2.0. We also learned ensembles of single task models by either training a classification/regression head on concatenated input embeddings or using weighted aggregate of the models' predictions. Apart from this, we also explored voting based ensemble of multi-task models. All our models perform comparably on all tasks and the major models are reported in Table 1. We also compare our best model performance with the top 3 submissions on the leaderboard and report it in Table 2.

Data Augmentation
One recurring issue across all our trained models is the high susceptibility to overfitting. Data Augmentation is a widely accepted solution to reduce overfitting by generating slight variants of the given dataset and is extremely useful for a smaller dataset. One such approach is Masked Language Modelling (MLM), used to perform context-specific data augmentation (Ma, 2019) and has been used in training LLMs. However, following this data augmentation during training has consistently degraded the performance of our models. We hypothesize that this is due to the mismatch be-tween the contextual meaning and the associated humor/offense. MLM-based augmentation strategies, with models pre-trained to preserve the sentence's meaning, fail to capture the associated humor/offense.
Often the selection of words in a sentence is responsible for its humor/offensive rating. Replacing such words by their synonyms can change the humor/offense rating substantially. Hence, using such a data augmentation approach during training will inject heavy noise in the ground truth resulting in deteriorated performance.

Correlation across Tasks
Contrary to our belief, we fail to ascertain any direct relationship between the humor controversy and the offense rating prediction task. We compute the mean offense rating for the texts labeled as controversial and for texts marked as non-controversial. The computed mean values are too close to each other to demonstrate any direct correlation conclusively.

Dataset Size
In literature, finetuning LLMs on small size task specific dataset has shown remarkable task performance. However, our single dedicated task models could not perform better than our multi-task model for Task 1b. We attribute this to relatively small size of supervised dataset available for Task 1b incomparison to other tasks. In our multi task models, though we have lesser labeled text for Task 1b, our sentence embeddings are still updated using the complete available dataset. Thus, our multi task model learns underlying distribution better than single task model owing to join learning and shared parameters for task 1b and 2. We believe that this is the main reason for the enhanced performance of our model on Task 1b which has lesser supervised data available in comparison to Task 1a or 2.

Conclusion
We have presented several experiments using large language models like BERT, XLNet, etc., and their ensembles for humor and offense detection and rating. We also discuss some of the underlying challenges due to the subjective nature of humor and offense detection task. Using these, we explain why standard training practices used to prevent overfitting, like data augmentation, do not work in this context. Our experiments suggest that even though these models can reasonably capture humor and offense, they are still far from understanding every intricacy arising out of subjectivity. To tackle some of the problems highlighted in this paper, a compelling direction would be online data augmentation by alternating between training the embeddings and generating new texts to preserve the humor/offensiveness. Additionally, pretraining these models on datasets annotated by diverse annotators to capture a more comprehensive world knowledge should further help in generalization.