JokeMeter at SemEval-2020 Task 7: Convolutional Humor

This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.


Introduction
This paper deals with estimating the humor of edited English news headlines (Hossain et al., 2020b;Hossain et al., 2020a). The illustration of tasks is in Figure 1. The original text sequence is given, which represents a title, with the annotated part that is edited along with the edit itself. Our responsibility is to determine how funny this change is in the range from 0 to 3 (inclusive). This is called Sub-Task 1. We also participate in the Sub-Task 2, in that we should decide which from both given edits is the funnier one. For the second task, we used the approach of reusing the model from the first task as it is described in section 4. So we are focusing the description on Sub-Task 1.  Official results were achieved with a Convolutional Neural Networks (CNNs) (LeCun et al., 1999;Fukushima and Miyake, 1982), but we also tested numerous other approaches such as SVM (Cortes and Vapnik, 1995) and pre-trained transformer model (Vaswani et al., 2017).
The humor is a very subjective phenomenon, as can be seen from the inter-agreement on label annotation in Sub-Task 1 dataset 1 . The given data labels do not allow us to learn a sense of humor of a human annotator because the dataset does not specify from whom the grade comes. So, for example, if we have one annotator that likes dark humor and all the others not, we will be considering such a replacement as not humorous no meter if it is excellent dark humor or not. In other words, we may say that we are searching for some most common kind of humor.
The dominant theory of humor is the Incongruity Theory (Morreall, 2016). It says that we are finding humor in perceiving something unexpected (incongruous) that violates expectations that were set up by the joke. There are samples, in the provided dataset, that uses the incongruity to create humor. Moreover, according to Hossain et al. (2020a), we can see a positive influence of incongruity on systems results for the dataset.

Related Work
Computational humor is usually divided into two groups: recognition and generation. Humor generation focuses on creating the humor itself, so it can be a system that is able to tell a joke.
Recognition can be a binary (funny or not) classification task, but in our work, we also want to know how funny a sequence is by rating it with a grade in a given interval. Hossain (2019), which introduced the dataset used in this work, tried to create models that can classify if a given edited title is funny or not, so they were training just binary classifiers, in contrast with our regression approach.
Our official results were achieved with a CNN model that is inspired by architecture presented in (Zhang and Wallace, 2015). Their architecture is compact to its size (one layer CNN), so it provides the advantage of using less computational resources, in contrast with big models like BERT. Even with the small size, they were able to achieve promising results for the sentence classification task. Also, such a small model allows us to gain better insight into what is going on underneath.

Data
In this section, we would like to point out some interesting facts about the used data. We are focusing on the data for Sub-Task 1 because we used the model trained on the Sub-Task 1 for Sub-Task 2 (more in section 4).
For each example, the dataset (Hossain et al., 2019) provides annotation in the form of grades (0,1,2 and 3) of humor assessment in sorted descending order and the mean of these grades. In most cases, there are five grades per dataset sample (sometimes more). In our work, we always use the first five grades. As can be seen from graphs in Figure 2, the dataset is imbalanced, and the high grades are rare. Also, we investigated how the dataset is imbalanced when considering just single n-th grade and omit the others. Though we can say, from the graph on the right side of the very same figure, that there is still imbalance, we can see that mainly the 2. and 3. positions seem to have smaller fluctuance among the number of samples per grade than the other positions.  We also did further analysis to determine prediction quality for the case when we would have an oracle classifier always predicting the grade on the n-th position. The results are in Table 1. It can be seen that the third position is superior.
Another thing that we decided to investigate is how the RMSE score would look like if we always predict the same grade. The results are in Table 2.

System Description
The main inspiration for our model architecture (JokeMeter) comes from Zhang and Wallace's (2015) work. The used model has CNN architecture illustrated in Figure 3. Firstly the input sequence is assembled. The edit is inserted into the original title after the part that is being edited. Additionally, we add a slash separator, and the whole original/edit location is delimited with two hashtags. In this way, we were able to add input for the model that has complete information about the original and the edited title. We also include tokens to mark the start and the end of a title. The reason behind these tokens is that we want to encode information into an n-gram, whether it is from the beginning/end of a title or not, and possibly make it easier for the model to learn the setup and punchline humor.  We tokenized the input by ALBERT (Lan et al., 2019) pre-trained SentencePiece (Kudo and Richardson, 2018) tokenizer. Each token is assigned a 128-dimensional embedding from 30 000 size vocabulary. Right before the convolution, is added zero-padding of size one on both sides of the sequence. Each sequence is 512 tokens long; shorter sequences are padded with unique padding tokens.
We used four different convolution filter region sizes 2, 3, 4, and 8. For each size, we had two filters. These filters are followed with LeakyReLu (Maas et al., 2013) activation (negative slope is 0.01). We also experimented with a model (JokeMeterBoosted) that uses 2048 filters for each size, 2048-dimensional embeddings, and does not use the embedding of the edit.
In the final part of our model, we do max pooling in order to get one feature per filter (8 in total). We concatenate the features to compose a vector, and then that vector is again concatenated with the edit embedding. The edit embedding is an average of embeddings of all tokens the edit is composed of.
On the very end, we perform linear transformation followed by the softmax to get probabilities for each grade. With that configuration, we would not be doing a regression task, so at the test time we must do one final calculation that will transform these probabilities into a grade from the continuous interval [0,3]: Where the G is a grade, s is a input sequence, and p i is the estimated probability for grade i. In the case of Sub-Task 2, we run the model for both titles separately, and in the end, we made the decision by comparing their estimated grades.

Convolutional features analysis
We used two filters for each region size because we expected that the model would be able to train one feature that will signalize funny and one feature that will signalize not funny. Nevertheless, our analysis instead shows, it learns features that just signalizes how much not funny the given n-gram is, as can be seen in Figure 4. To gain further insight into this property, we calculated, for the Sub-Task 1 train set, Spearman's rank correlation coefficient between mean grade and each feature (after max pooling). The results in Table 3 shows a negative correlation that corresponds to our hypothesis. An interesting finding is that quite a lot of features have zero variance, which means that constant was learned. That leads us to think that these features reflect the fact that we are able to achieve relatively good RMSE with predicting constant (e.g., one as shown in Table 2) due to the imbalance in the dataset.

Experiments
In this section, we describe the experiments we performed, not just with the model that is described in section 4. Apart from models based on neural networks, we compiled several baselines: Decision Tree Classifier (DTC) (Breiman et al., 1984), SVM, k-NN, and Naive Bayes Classifier (NBC). Also, we experimented with a model that uses transformer architecture (ALBERT-base-v2). The neural models were implemented with PyTorch (Paszke et al., 2019). For the ALBERT and the tokenizer, we used Hugging Face (Wolf et al., 2019) implementations, and for non-neural models, the implementations from the scikit-learn (Pedregosa et al., 2011) were used.
We did two kinds of experiments for all models. The first kind uses all five grades for each sample (all-grades training) during the training. Every sample is copied five times, and a grade is assigned to each of them, so we have five samples that have the same content, but each can have a different grade. On the other hand, the second kind of experiment uses only the 3rd grade (3-grade training), which has for oracle classifier the best score (see Table 1).  Table 4: Results on the test set for non-neural models. The number of neighbors for k-NN differs among the experiments. For the Sub-Task 1 we use k = 5, and for Sub-Task 2 k = 13 is used.  Table 5: Results on the test set for neural models. We also provide results of the baseline and the best model in the competition. Our official results are in JM (all) column. The JM abbreviation means JokeMeter, JMB is JokeMeterBoosted, JA is JokeALBERT, all means that the all-grades training was used and 3. that the 3-grade training was used.

Non-neural models
These models deal with classification, meaning the model must decide between 4 grades instead of selecting a value from [0,3]. TF-IDF word features are used for every model. All models are trained on the train set. We show results for two types of experiments. When the original sequence and the edit (the new word we are inserting into the title) are provided and when we only provide the edit word. The results can be seen in Table 4. These models are not even able to achieve results that can be provided by the simple model that predicts constant (see Table 2). Interestingly, comparable results can be achieved with just using the edit word.

Neural models
In addition to the model used in our submission (and its version JokeMeterBoosted), we performed experiments with a system using a pre-trained ALBERT model (JokeALBERT). JokeALBERT utilizes contextual embeddings of the whole input sequence from ALBERT and then selects those that belong to the edited word and averages them into one. Finally, the linear transformation with softmax is applied.
To all our neural models, we provided input that uses the same format (see section 4). For both models, we use the cross-entropy loss. We used Adam (Kingma and Ba, 2014) with weight decay (Loshchilov and Hutter, 2017) as a optimizer. We stop the training after no improvement in RMSE on the dev set in five subsequent epochs. The results for these models are presented in Table 5. Results for JokeMeter and JokeALBERT were obtained with batch size 16 and learning rate 1e-5.

Evaluation and baseline
All results for trained models were evaluated with the official scripts. For the Sub-Task 1, we always show the root-mean-square error (RMSE) metric, and for the Sub-Task 2, the accuracy is used.
The baseline system for Sub-Task 1 always predicts the mean funniness grade from the training set (0.936), and for Sub-Task 2, it always predicts the most frequent label in the training set (1).

Conclusion
The system description was provided, and we compare the achieved results of the official model with several other models, including the baseline and the best team in the competition.  Figure 5: Comparison of grade predictions for multiple models and the truth labels. We can see that JokeMeter and JokeALBERT's predictions are focused on a small interval around the one. JokeMeter, JokeMeterBoosted and JokeALBERT are for all-grades training, and the SVM is for 3-grade training.
In future work, it should be more investigated if imbalanced dataset and small inter-annotator agreement caused that the JokeMeter model was more focused on the prior probabilities of grades and not on the input itself (see Figure 5).

A.1 Embeddings
In Figure 6 can be seen the influence of the token embedding size on the RMSE. The rest of the JokeMeter model configuration remains the same as described in section 4.  Figure 6: Influence of the size of the token embedding on the RMSE.

A.2 Convolutional features
In Figure 7 can be seen the influence of the used convolutional filters per region size on the RMSE. The rest of the JokeMeter model configuration remains the same as described in section 4. As shown in Figure 8 there is a different relation between token embedding size and RMSE for 2048 convolutional filters per region size than for the default 2 (see Figure 6).

A.3 Ablation experiments
In this section are presented results for ablation experiments for JokeMeter that uses 2048 convolutional filters per region size and 2048 size of the token embeddings. The rest of the model configuration remains the same as described in section 4.
The influence of used features summarizes Table 6. We can see that the usage of edit embedding does not improve results. We used these findings to create a model that is using 2048 convolutional filters per region size, 2048 dimensional token embeddings, and no edit embedding; we call it JokeMeterBoosted.

RMSE
convolutional features only 0.550260959621652 ± 0.0012 edit embedding only 0.63520130408731 ± 0.0005 convolutional features and edit embedding 0.5505674648279042 ± 0.0008 Table 6: Results for the ablation experiments on the JokeMeterBoosted.

A.4 Batch size and learning rate analysis for JokeMeterBoosted
According to Figure 9 we used for the JokeMeterBoosted batch size 64 and learning rate 1E-5.