YoungSheldon at SemEval-2021 Task 7: Fine-tuning Is All You Need

In this paper, we describe our system used for SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense. We used a simple fine-tuning approach using different Pre-trained Language Models (PLMs) to evaluate their performance for humor and offense detection. For regression tasks, we averaged the scores of different models leading to better performance than the original models. We participated in all SubTasks. Our best performing system was ranked 4 in SubTask 1-b, 8 in SubTask 1-c, 12 in SubTask 2, and performed well in SubTask 1-a. We further show comprehensive results using different pre-trained language models which will help as baselines for future work.


Introduction
Humor is an intelligent form of communication with the capability of providing amusement and provoking laughter (Chen and Soo, 2018). It helps in bridging the gap between various languages, cultures, and demographics. Humor is a very subjective phenomenon. It can have different intensities, and people may find some jokes funnier than others. In certain situations, some jokes may be offensive to a certain group of people. All these characteristics of humor pose an interesting linguistic challenge to NLP systems. SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense (Meaney et al., 2021) aims to draw attention to these challenges in humor detection. The task provides a dataset of humorous content annotated using people representing different age groups, gender, political stances, and income levels. The content of the provided dataset was in English.
Participating in all SubTasks, we propose a finetuning based approach on pre-trained language models. Pre-trained Language Models learn syntactic and semantic representations by training on large amounts of unsupervised data. Recently there has been a lot of interest in PLMs. Researchers have come up with different pre-training methods using Auto Encoding(AE) and Auto-Regressive(AR) language modeling techniques. Often these pre-trained models contain millions of parameters and are computationally expensive. Finetuning different models may lead to different results on downstream tasks. This makes the choice of PLM an important factor. We present a comparative study of different PLM models and their performance in all SubTasks of SemEval 2021 Task 7: HaHackathon.
Our proposed fine-tuning approach for each PLM made use of a single layer of one neuron stacked on the PLM features. We performed experiments using BERT (Devlin et al., 2019), ELEC-TRA (Clark et al., 2020), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), MPNet (Song et al., 2020), and ALBERT (Lan et al., 2020). For regression tasks, we also used averaging technique, which we describe later in the paper. Our model performed well in SubTask 1-b and 1-c, achieving a rank of 4 and 8 respectively on the official leader board. For SubTask 2, our proposed averaging technique outperformed individual fine-tuned models by a good margin and was ranked 12. Our code is available online 1 for method replicability.

Background
The task of automatic humor recognition refers to deciding whether a given sentence expresses humor. The problem of humor recognition is often formulated as a binary classification problem aiming to identify if the given text is humorous. (Weller and Seppi, 2019) performed a study to identify if a joke is humorous or not using transformers (Vaswani et al., 2017). They used the body of the joke, the punchline exclusively, and both parts together. Combining both parts lead to better performance and it was found that a punchline carries more weight than the body of a joke for humor identification. (de Oliveira and Rodrigo, 2015) experimented with SVM's, RNN's, and CNN's for identifying humor in Yelp reviews using a bag of words and mean word vector representations. (Chen and Soo, 2018) used CNN-based models for identifying humor content. (Annamoradnejad, 2021) uses a neural network built on BERT embeddings learning features for sentences and whole text separately and then combining them for prediction on 200k Short Texts for Humor Detection dataset on Kaggle 2 . Binary classification tasks help us to separate humorous content but are unable to quantify the degree of humor. SemEval-2017 Task 6: #HashtagWars (Potash et al., 2017) aimed to study the relative humor content of funny tweets by either generating the correct pairwise comparisons of tweets (SubTask A) or finding the correct ranking of the tweets (SubTask B) based on their degree of humor content. SemEval-2020 Task 7: Assessing Humor in Edited News Headlines (Hossain et al., 2020) presented a study on editing news headlines to make them humorous. The task involved quantifying the humor of the edited headline on a scale of (0-3) as well as comparing the humor content of the original and edited headline. SemEval-2020 Task 8: Memotion Analysis-The Visuo-Lingual Metaphor! (Sharma et al., 2020) provides details on humor classification as well as predicting its semantic scale on internet memes using both images and text. OffensEval (Zampieri et al., 2020) (Zampieri et al., 2019) provides insights for identifying offensive content on social media.
Humor is an intelligent way of communication in our daily lives. It helps bridge the gap between people from various cultures, ages, gender, languages, and socioeconomic status making it a powerful tool to connect with the audience. Humor is a highly subjective phenomenon. People from different demographics may have a different perception of humor, and some may even find it offensive. This makes identifying humor a tough task. SemEval 2021 Task 7: Hahackathon: Linking humor and offense across different age groups aims to study this subjective nature of humor, which has two Sub-Tasks which we describe as: SubTask 1: Given a labeled dataset D of texts, the task aims to learn a function that can: • SubTask 1-a: predict if a text is humorous or not.
• SubTask 1-b: quantify humor present in a humorous text within a range of (0-5).
• SubTask 1-c: predict if the humor rating would be controversial for a humorous text, i.e., the variance of the rating between annotators is higher than the median.
SubTask 2: Given a labeled dataset D of texts, the task aims to learn a regression function that can quantify how offensive a text is for general users within a range of (0-5).
Dataset Statistics: Table 1 represents the dataset statistics for classification tasks. For SubTask 1-a we can see there is a slight class imbalance between humorous and non-humorous labels. We overcome the problem of class imbalance using class weights which we define as: Let X be the vector containing counts of each class X i where i ∈ X. Then the weights for each class were given as: For SubTask 1-c the label distribution was balanced. Table 2 represents the statistics for the regression tasks. Another observation on the training set for SubTask 2 was that 3388 samples had an offensive rating of 0 and nearly 80% of samples had offense rating in the range 0-1.

Pre-trained Language Models
NLP being a diverse field contains many tasks, but most task-specific datasets contain only a few hundred or a thousand human-labeled samples. To overcome this problem, researchers have come up with a method called pre-training (Qiu et al., 2020) which involves training general-purpose language representation using enormous amounts of unannotated textual data. These language models can then be fine-tuned on various downstream tasks and have shown promising results in many natural processing tasks (Dai and Le, 2015;Peters et al., 2018;Radford and Narasimhan, 2018). Next, we briefly discuss some pre-trained language models we used for the task.

Brief overview of used Pre-trained Models
BERT: Bidirectional Encoder Representations from Transformers is a bi-directional language model that uses Transformer (Vaswani et al., 2017) architecture to learn contextual relations between different words in a text sequence (Devlin et al., 2019). It makes use of two training strategies i.e., Masked Language Modelling (MLM) and Next Sentence Prediction (NSP).
ELECTRA: It introduces a new pre-training objective called Replaced Token Detection (RTD) (Clark et al., 2020). Unlike BERT which introduces <MASK> tokens, Electra replaces certain tokens with plausible fakes. The pre-training task then requires the model to determine if the input tokens are the same or have been replaced. This binary classification task is applied to all tokens unlike the small number of masked tokens making RTD more efficient than MLM.
RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019) was developed by Facebook. They made use of the BERT architecture with modifications to improve the performance on downstream tasks. They made use of dynamic masking in the pre-training objective and removed the NSP objective. They also trained the model for a longer duration with more data and a larger batch size. They outperformed BERT on several downstream tasks.
XLNet: XLNet (Yang et al., 2019) is a generalized autoregressive pre-training method that takes the best of both AR language modeling and AE modeling techniques. It proposed a permutation language modeling objective for pre-training that helps learn bidirectional contexts. It also helps overcome the pretrain-finetune (Yang et al., 2019) discrepancy present in BERT due to its autoregressive formulation.
MPNet: MPNet (Song et al., 2020) was proposed by Microsoft. It overcomes the positional discrepancy between pre-training and fine-tuning in XLNet which does not use the full position information of a sentence. It proposes a unified view of masked language modeling and permuted language modeling by rearranging and splitting the tokens into predicted and non-predicted parts. It uses MLM and PLM to model the dependency among predicted tokens and see the position information of the full sentence.
ALBERT: A Lite BERT for self-supervised learning of language representations (Lan et al., 2020) is a modification of BERT aiming to effi-ciently allocate the model capacity to help reduce training time and reduce memory consumption. ALBERT decomposes the embedding matrix into a lower dimension which is then projected to the hidden space. This is called factorized embedding parameterization and helps in reducing the parameters. It also makes use of layer sharing across all layers which helps remove redundancy. Additionally, it uses inter-sentence coherence loss based on Sentence Order Prediction (SOP) (Lan et al., 2020).

Fine-tuning
We fine-tuned the pre-trained language models for each SubTask by stacking a dropout layer followed by a single neuron dense layer on top of PLM features. We used the features of [CLS] token in the case of ALBERT, BERT, XLNet, ELECTRA, and start token (<s>) features in the case of RoBERTa, and MPNet. For the classification task, sigmoid activation was used in the final layer. For the regression task, we did not use any activation. Negative values were converted to zero in regression tasks.

Averaging for Regression tasks
For regression tasks, we combined all fine-tuned models by averaging their predictions. For SubTask 1-b, we averaged the predictions of all models. For SubTask 2 as stated earlier, there were many zero values in the training set therefore, we averaged the predictions only when all models predicted a nonzero value. If any of the models predicted zero for a given sample, we took zero as the final prediction.

Experimental Setup
We used ekphrasis (Baziotis et al., 2017) library for pre-processing the text inputs. It normalized date, time, numbers to a standard format and also performed spelling correction. For tokenization, we used Hugging Face's (Wolf et al., 2020) implementation of fast tokenizers for each pre-trained model. We fixed the sequence length of samples to 150 tokens. Models were developed on Keras 3 (Chollet et al., 2015) using the transformers 4 (Wolf et al., 2020) library by Hugging Face. We used Adam (Kingma and Ba, 2017) optimizer for fine-tuning. Learning rate of 1e-4 was used for ELECTRA. For other models, we experimented with 1e-5, 2e-5, and 3e-5. We used binary cross-entropy loss for classification tasks and logCosh loss for regression 3 https://keras.io 4 https://huggingface.co/transformers tasks. Batch size of 16 was used for all models. Fine-tuning was performed on TPU's on Google Colab. We fine-tuned for four epochs on SubTask 1 and 8 epochs on SubTask 2. F1 score and RMSE were used as an evaluation metric for classification and regression tasks. Weights with the best performance on the development set were used for making predictions on the test set. Table 3 shows the results of our proposed finetuning approach for different pre-trained models. Our simple averaging technique worked quite well for regression tasks. Our model was ranked 4 in SubTask 1-b and ranked 12 in SubTask 2. The averaging method proposed by us for SubTask 2 provided a significant improvement in the RMSE score against the individually pre-trained models. Upon examination of the test set, we found 40.8% of samples were given zero offense rating. Thus, our decision to predict zero if any of the model predicted zero helped in improving scores against individual models. For classification tasks, our model was ranked 8 in SubTask 1-c and performed well for SubTask 1-a. Figure 1 and Figure 2 show plots of confusion matrices for our best performing model fine-tuned on BERT. For Subtask 1-a, our model was efficient in separating the humorous and non-humorous content as false positives are low for each class. For SubTask 1-c, our model performed well in identifying the controversial text but did not perform   very well for non-controversial text. The model has a very high recall but low precision due to high false positives for the controversial class which is evident from the confusion matrix. BERT, MPNet, and XLNet performed better than other PLMs for SubTask 1-a. For SubTask 1-b individual models had a similar performance. Averaging helped in improving the performance. BERT, ELECTRA, and ALBERT had the best performance on the test set for SubTask 1-c.

Conclusion
The paper describes our system used for competing in all SubTasks of SemEval 2021 Task 7: Ha-Hackathon: Detecting and Rating Humor and Offense. We used a simple fine-tuning approach for analyzing the performance of various pre-trained language models for the task of humor detection. We performed well in all SubTasks except Sub-Task 1-a. A lot of research is happening around  pre-trained language models with new and better models coming up. These models are large and computationally expensive. Choosing a model becomes a difficult task as they may have different results on downstream tasks. We, therefore, performed experiments with the recent state-of-the-art models and provide a comparative analysis of their performance. In the future, we would like to work on the effect of pre-training PLMs with additional task-specific data and then fine-tuning to see their performance on downstream tasks.