MLEngineer at SemEval-2020 Task 7: BERT-Flair Based Humor Detection Model (BFHumor)

Task 7, Assessing the Funniness of Edited News Headlines, in the International Workshop SemEval2020 introduces two sub-tasks to predict the funniness values of edited news headlines from the Reddit website. This paper proposes the BFHumor model of the MLEngineer team that participates in both sub-tasks in this competition. The BFHumor’s model is defined as a BERT-Flair based humor detection model that is a combination of different pre-trained models with various Natural Language Processing (NLP) techniques. The Bidirectional Encoder Representations from Transformers (BERT) regressor is considered the primary pre-trained model in our approach, whereas Flair is the main NLP library. It is worth mentioning that the BFHumor model has been ranked 4th in sub-task1 with a root mean square error (RMSE) value of 0.51966, and it is 0.02 away from the first ranked model. Also, the team is ranked 12th in the sub-task2 with an accuracy of 0.62291, which is 0.05 away from the top-ranked model. Our results indicate that the BFHumor model is one of the top models for detecting humor in the text.


Introduction
Humor is a distinctive quality of humans, which is considered a figurative language; where the person can express his feelings through social media once entertainment media have existed or if he faces a funny event. Researchers found that humor leads to enhance the human's health and mood (Hernández-Farías et al., 2015;Mao and Liu, 2019;Reyes et al., 2012;Yan and Pedersen, 2017).
In the last decades, the world has witnessed a rapid move in Artificial Intelligent (AI) field toward simulating human intelligence in machines or computer systems (Wen et al., 2019;Raja and S, 2019). One of the main branches of AI is Natural Language Processing (NLP), where the computer learns how to understand human languages regardless the diversity and complexity of human languages that are considered open challenges for NLP systems and communities (Abdullah and Shaikh, 2018;Zhou et al., 2020). Knowing that humor has many facets and can be produced through words (texts), gestures (vision), and prosodic cues (acoustic) (Hasan et al., 2019;, humor detection is considered a multimodal language. The reasons behind the difficulty of detecting humor are due to (1) the nature of the context, emotion, and rhythm, which contains a large degree of complexity for the machine to understand the level of the humor. (2) the cultural differences, knowing that humor is universal, the different cultures see the humor in various ways (Martin et al., 1993).
Detecting humor from social media sites or TV dialogues using a set of algorithms had been studied thoroughly in the NLP field (Glazkova et al., 2019;Bertero and Fung, 2016). In this paper, we detect the humor through news headlines from the Reddit website 1 that are provided by SemEval-2020 organizers (Hossain et al., 2020). The International Workshop on Semantic Evaluation SemEval-2020 has introduced several NLP shared tasks. Task 7, Assessing the Funniness of Edited News Headlines, aims to motivate participants to build systems to predict the level of funniness in edited news headlines. The task asked participants to replace a word in the headline using a micro-edit and check if this modification makes the news headlines more funnier or not. The task consists of two subtasks (sub-task1 and sub-task2). In sub-task1, the participants should predict the funniness of the edited headline based on RMSE value. While sub-task2 is intended to predict which edited headline is more funnier based on the accuracy value. Further clarifications about task 7 and the dataset in section 3. Our team, MLEngineer, proposed a BFHumor model for predicting the funniness value of the edited news headlines in both subtasks. Our model is an ensembling model of different state-of-the-art pre-trained models with various NLP techniques. The MLEngineer has been ranked 4 th in sub-task1 and 12 th in sub-task2 out of 84 teams.
The remaining sections of this paper are organized as follows: Section 2 describes the existed work on detecting humor. Section 3 provides insights into the proposed system methodology and architecture. Section 4 presents the key experimental setup and BFHumor results. Finally, in Section 5, we conclude with the findings of our research.

Related work
A plethora of research has studied the figurative languages in various datasets to predict the funniness value. The publication race in this area is growing at a fast pace between the industry and academic researchers. This section presents some analysis of the existed work related to humor, irony, ridicule, satire detection, and assessing funniness.
The authors in (Mao and Liu, 2019;Ortega-Bueno et al., 2018;Farzin et al., 2019;Castro et al., 2018;Weller and Seppi, 2019;Garain, 2019) studied humor detection in various languages to classify each tweet into a joke or not and how they are funny. In (Mao and Liu, 2019), the authors classified the dataset into a joke or not with a dataset that was obtained from Twitter in the Spanish language. While Bueno et al. (Ortega-Bueno et al., 2018), predicted the level of funniness in tweets (score value prediction based on the average value of 5-stars).
Several machine learning techniques have been used to predict the funniness value. In (Garain, 2019), the researchers proposed a model using deep learning algorithms in the HAHA-2019 task, such as Bidirectional-LSTM (BiLSTM) and LSTM. The researchers in (Blinov et al., 2019)  In this paper, we have built a robust model, named BFHumor, dealing with continuous labels, which aims to predict the funniness value of the news headline between zero and three. This model is a combination of the various pre-trained model and techniques that are mainly BERT regressor and the NLP Flair library. Up to our knowledge, we are the first to use these state-of-the-art techniques for humor detection.

System Methodology
In this section, we present the workflow of our proposed system to predict the funniness of edited news headlines as shown in Figure 1. More details about the workflow are in the following subsections.

Input Data
The dataset used in this paper is obtained from Task 7 of SemEval2020. The organizers of this task (Hossain et al., 2019;Hossain et al., 2020) collected the original news headlines datasets from a popular website, Reddit. This data is called Humicroedit that was reduced to 15,095 headlines for sub-task1 and 14,696 headlines for sub-task2. The dataset is divided into training, development, and testing dataset for two sub-tasks, as shown in Table 1. Some examples of original and edited news headlines appear in Table  2.

Data Pre-processing and Cleaning
The preprocessing and cleaning procedure consists of two critical steps: (1) Word-Replacement and (2) Data Pre-processing. In the first step, we replaced the substitutional word with the word between </>. Such as replacing the Vice word with school word in ("Trump was told weeks ago that Flynn misled <Vice/>President") that is became ("Trump was told weeks ago that Flynn misled school President").
In the second step, we have experimented with several preprocessing packages, however, the best results of the model came with using the original data with a few amount of preprocessing. We have applied existing preprocessing packages for cleaning the data, such as ekphrasis 2 , spaCy 3 , and clean-text. 4 We have tokenized the headline by splitting it into chunks of words. Then, we have applied spell correction on those words, unpacked the contractions (can't to can not), and unpacked the hashtags by applying word segmentation (#MeToo to "<hashtag>Me Too </hashtag>"). Also, we have converted the text into a lower case and deleted stop words using spaCy package. Finally, we have removed numbers, currency symbols (i.e.$,£), punctuation marks (i.e.,?!:;()[]#@), and applied lower case in the clean-text package.

Building BFHumor Model
In this subsection, we describe our proposed model, the BERT-Flair-based Humor Detection Model (BFHumor). BFHumor model is built using two main components: the BERT regressor and the Flair library, as shown in Figure 2.
Sub-model1: we have applied a novel BERT regressor models from the bert-sklearn library with four types: (bert-base-cased, bert-base-uncased, bert-large-cased, and bert-base-cased-finetuned-mrpc) 5 (Wolf et al., 2019). The input of Sub-model1 is the training dataset to predict the corresponding mean grade (label) of that testing dataset. We have trained our regressor with a training batch size of 16 and an evaluation batch of 8 for 2 epochs. The learning rate is 3e-5, the validation fraction is 0.0; gradient accumulation steps are 1, and the maximum length is 64.
Sub-model2: we have used the Flair NLP library, which (1) applies several NLP models to text (i.e., named entity recognition) (2) supports multi-language for text (i.e., German, and French) (3) provides an interface that extracts word embeddings (Akbik et al., 2019). After that, we have used a RoBERTa type 6 Figure 2: The architecture of BFHumor model. as the underlying Flair to extract word embeddings from the training and testing dataset, especially from roberta-large-mnli. Then feeds them to the machine learning algorithms. Sub-model2 details: 24 layers, 1024 hidden layers, 16 heads, and 355 million parameters.
Finally, to predict the funniness values of the testing data in the two-sub-tasks, the Naïve Bayes regressor (Mayo and Frank, 2020) reads these embeddings as an input and returns the predictions. Then, we ensembled the predictions of BERT and Naïve Bayes based on different weights based on several experiments for the best results in both sub-tasks.

Sub-task1
Several experiments were applied for sub-task1 with using: BERT regressor with two types bert-basecased and bert-base-uncased, XLNet pre-trained model (Yang et al., 2019), Recurrent neural network (RNN) and NB (Mayo and Frank, 2020;De Mulder et al., 2015). More details about models information are shown in Table 3.
Firstly, we applied the bert-base-cased, bert-base-uncased as underlying Bert type (Devlin et al., 2018), and XLNet pre-trained model (Yang et al., 2019;Wolf et al., 2019) with different hyper-parameters from Anaconda software 7 using Google Colab. 8 We used the model sizes in BERT and XLNet models as following: number of layers= 12, hidden size= 768, number of self-attention heads= 12 and total parameters = 110 million. The RMSE results are as follows: 0.55974, 0.62023, and 0.57896 respectively.
Secondly, the architecture of the used RNN contains input, hidden, and output layers (De Mulder et al., 2015). The input layer consists of word2vec as a word embedding with size = 300. Then, two Long Term Short Memory layers ( 256 units, recurrent dropout is equal to 0.2, dropout is equal to 0.1, and the return sequences is equal to true and false in two layers). Also, two dense layers (256 units with activation function is ReLU), and two dropout layers (dropout is equal to 0.3). The output layer is Dense with one unit and the sigmoid activation function. The loss function is mean squared error; optimizer function is adam, metric used is mean squared error, the batch size is 52, and the number of the epoch is 200. Based on the above parameters, the RMSE value obtained is 0.57855.
We extracted 153 features using AffectiveTweets package (Bravo-Marquez et al., 2019) in Weka tool 9 plus Python codes. The Weka features are extracted by using Embedding, Input Lexicon, and Sentistrenght functions. While Python features are Jaccard, cosine, and complexity. Then, we fed them into NB, which gave 0.57223 as an RMSE value.  We also conducted several experiments for sub-task2, such as Bert regressor with bert-large-cased, ELMO (Peters et al., 2018), and roberta-large-mnli as a RoBERTa type from Flair library. Firstly, we applied the bert-large-cased as underlying BERT type (Devlin et al., 2018) in Anaconda software. We used the model sizes in this type as following: number of layers= 24, the hidden size= 1024, the number of self-attention heads= 16, and total parameters = 340 million. The accuracy result is 0.55741.
Secondly, we used the roberta-large-mnli and ELMO models. Then, we fed the word embeddings of these models into an NB algorithm to obtain the accuracy values as 0.58980, and 0.54931, respectively. The hyper-parameters, number of dimensions, algorithms, accuracy values for the three above models are shown in Table 4.

BFHumor Results
In both sub-tasks, we use the same general structure with only one difference, which is the output phase (see Figure 2). Each sub-task passes the edited headlines through two different models, which produces two predictions and merges them.
We have noticed the following findings when we conducted several experiments: 1. Using two epochs in any of the BERT regressor model types gives better prediction results than using more than two epochs. But rarely, one epoch was giving the best prediction results.
2. Splitting the data into 0.8 and 0.2 for train and test data, respectively using the Hold-out method in NB regressor gives the best results.
3. Using the roberta-large-mnli word embedding with NB gives the best results.  4. Feeding all training datasets after using a clean-text package to the bert-based-uncased type as an input gives the best results.
MLEngineer team has participated in task 7 of SemEval-2020 using the BFHumor model has achieved the 4 th rank out of 49 participants in sub-task1 based on the RMSE metric, and the 12 th rank out of 32 participants in sub-task2 based on the accuracy metric as shown in Table 5.

Conclusion
In this paper, we have participated in Task 7 of SemEval 2020 as the MLEngineer team. We presented our novel model, BFHumor, BERT-Flair based Humor Detection Model to predict the funniest values of edited headlines in both sub-tasks. BFHumor is a unique model because it combined the BERT regressor and Flair library and used the same underlying architecture in both sub-tasks.
In the BFHumor model, we selected the bert-base-cased, bert-base-uncased, bert-large-cased, and bert-base-cased-finetuned-mrpc as the underlying BERT regressor models. Also, we chose the robertalarge-mnli as an underlying Flair type. Then, we merged the prediction results of two sub-models.
The performance of the BFHumor model outperformed the baseline system in the competition task, indicating that it is a promising model for detecting humor in the text. based on RMSE value and the accuracy: we were among the top 4 teams in the sub-task1 with an RMSE value of 0.51966, which is 0.02 away from the first ranked model. Also, we got the 12 th in the sub-task2 with an accuracy of 0.62291, which is 0.05 away from the first ranked model.