Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition

Humor recognition has been widely studied as a text classification problem using data-driven approaches. However, most existing work does not examine the actual joke mechanism to understand humor. We break down any joke into two distinct components: the set-up and the punchline, and further explore the special relationship between them. Inspired by the incongruity theory of humor, we model the set-up as the part developing semantic uncertainty, and the punchline disrupting audience expectations. With increasingly powerful language models, we were able to feed the set-up along with the punchline into the GPT-2 language model, and calculate the uncertainty and surprisal values of the jokes. By conducting experiments on the SemEval 2021 Task 7 dataset, we found that these two features have better capabilities of telling jokes from non-jokes, compared with existing baselines.


Introduction
One of the important aspects of computational humor is to develop computer programs capable of recognizing humor in text. Early work on humor recognition (Mihalcea and Strapparava, 2005) proposed heuristic-based humor-specific stylistic features, for example alliteration, antonymy, and adult slang. More recent work (Yang et al., 2015;Chen and Soo, 2018;Weller and Seppi, 2019) regarded the problem as a text classification task, and adopted statistical machine learning methods and neural networks to train models on humor datasets. However, only few of the deep learning methods have tried to establish a connection between humor recognition and humor theories. Thus, one research direction in humor recognition is to bridge the disciplines of linguistics and artificial intelligence.
In this paper, we restrict the subject of investigation to jokes, one of the most common humor types Set-up: Today my neighbor knocked at my door at 3am. Can you believe that? I was so pissed off.
Punchline: Lucky for him that I was awake playing the drums! in text form. As shown in Figure 1, these jokes usually consist of a set-up and a punchline. The set-up creates a situation that introduces the hearer into the story framework, and the punchline concludes the joke in a succinct way, intended to make the hearer laugh. Perhaps the most suitable humor theory for explaining such humor phenomenon is the incongruity theory, which states that the cause of laughter is the perception of something incongruous (the punchline) that violates the hearer's expectation (the set-up).
Based on the incongruity theory, we propose two features for humor recognition, by calculating the degree of incongruity between the set-up and the punchline. Recently popular pre-trained language models enable us to study such relationship based on large-scale corpora. Specifically, we fed the set-up along with the punchline into the GPT-2 language model (Radford et al., 2019), and obtained the surprisal and uncertainty values of the joke, indicating how surprising it is for the model to generate the punchline, and the uncertainty while generating it. We conducted experiments on a manually labeled humor dataset, and the results showed that these two features could better distinguish jokes from non-jokes, compared with existing baselines. Our work made an attempt to bridge humor theories and humor recognition by applying large-scale pre-trained language models, and we hope it could inspire future research in computational humor. Humor Recognition Most of the existing work on humor recognition in text focuses on one-liners, one type of jokes that delivers the laughter in a single line. The methodologies typically fall into two categories: feature engineering and deep learning. Mihalcea and Strapparava (2005) designed three human-centric features (alliteration, antonymy and synonym) for recognizing humor in the curated oneliner dataset. Mihalcea et al. (2010) approached the problem by calculating the semantic relatedness between the set-up and the punchline (they evaluated 150 one-liners by manually splitting them into "setup" and "punchline"). Shahaf et al. (2015) investigated funny captions for cartoons and proposed several features including perplexity to distinguish between funny and less funny captions. Morales and Zhai (2017) proposed a probabilistic model and leveraged background text sources (such as Wikipedia) to identify humorous Yelp reviews. Liu et al. (2018) proposed to model sentiment association between elementary discourse units and designed features based on discourse relations. Cattle and Ma (2018) explored the usage of word associations as a semantic relatedness feature in a binary humor classification task. With neural networks being popular in recent years, some deep learning structures have been developed for the recognition of humor in text. Chen and Lee (2017) and Chen and Soo (2018) adopted convolutional neural networks, while Weller and Seppi (2019) used a Transformer architecture to do the classification task. Fan et al. (2020) incorporated extra phonetic and semantic (ambiguity) information into the deep learning framework. In addition to these methodological papers, there are also some tasks dedicated to computational humor in recent years. SemEval 2020 Task 7 (Hossain et al., 2020) aims at assessing humor in edited news headlines. SemEval 2021 Task 7 (Meaney et al., 2021) involves predicting the humor rating of the given text, and if the rating is controversial or not. In this task, Xie et al. (2021) adopted the DeBERTa architecture (He et al., 2020) with disentangled attention mechanism to predict the humor labels.
Although the work of Mihalcea et al. (2010) is the closest to ours, we are the first to bridge the incongruity theory of humor and large-scale pretrained language models. Other work (Bertero and Fung, 2016) has attempted to predict punchlines in conversations extracted from TV series, but their subject of investigation should be inherently different from ours-punchlines in conversations largely depend on the preceding utterances, while jokes are much more succinct and self-contained.

Humor Theories
The attempts to explain humor date back to the age of ancient Greece, where philosophers like Plato and Aristotle regarded the enjoyment of comedy as a form of scorn, and held critical opinions towards laughter. These philosophical comments on humor were summarized as the superiority theory, which states that laughter expresses a feeling of superiority over other people's misfortunes or shortcomings. Starting from the 18 th century, two other humor theories began to challenge the dominance of the superiority theory: the relief theory and the incongruity theory. The relief theory argues that laughter serves to facilitate the relief of pressure for the nervous system (Morreall, 2020). This explains why laughter is caused when people recognize taboo subjects-one typical example is the wide usage of sexual terms in jokes. The incongruity theory, supported by Kant (1790), Schopenhauer (1883), and many later philosophers and psychologists, states that laughter comes from the perception of something incongruous that violates the expectations. This view of humor fits well the types of jokes commonly found in stand-up comedies, where the set-up establishes an expectation, and then the punchline violates it. As an expansion of the incongruity theory, Raskin (1979) proposed the Semantic Script-based Theory of Humor (SSTH) by applying the semantic script theory. It posits that, in order to produce verbal humor, two requirements should be fulfilled: (1) The text is compatible with two different scripts; (2) The two scripts with which the text is compatible are opposite.

Methodology
The incongruity theory attributes humor to the violation of expectation. This means the punchline delivers the incongruity that turns over the expectation established by the set-up, making it possible to interpret the set-up in a completely different way. With neural networks blooming in recent years, pretrained language models make it possible to study such relationship between the set-up and the punchline based on large-scale corpora. Given the set-up, language models are capable of writing expected continuations, enabling us to measure the degree of incongruity, by comparing the actual punchline with what the language model is likely to generate.
In this paper, we leverage the GPT-2 language model (Radford et al., 2019), a Transformer-based architecture trained on the WebText dataset. We chose GPT-2 because: (1) GPT-2 is already pretrained on massive data and publicly available online, which spares us the training process; (2) it is domain independent, thus suitable for modeling various styles of English text. Our goal is to model the set-up and the punchline as a whole piece of text using GPT-2, and analyze the probability of generating the punchline given the set-up. In the following text, we denote the set-up as x, and the punchline as y. Basically, we are interested in two quantities regarding the probability distribution p(y|x): uncertainty and surprisal, which are elaborated in the next two sections.

Uncertainty
The first question we are interested in is: given the set-up, how uncertain it is for the language model to continue? This question is related to SSTH, which states that, for a piece of text to be humorous, it should be compatible with two different scripts. To put it under the framework of set-up and punchline, this means the set-up could have multiple ways of interpretation, according to the following punchline. Thus, one would expect a higher uncertainty GPT-2 x m · · · x 2 x 1 y 1 y 2 · · · y n−1 v1 v2 v3 · · · vn y 1 y 2 y 3 y n value when the language model tries to continue the set-up and generate the punchline. We propose to calculate the averaged entropy of the probability distributions at all token positions of the punchline, to represent the degree of uncertainty. As shown in Figure 2, the set-up x and the punchline y are concatenated and then fed into GPT-2 to predict the next token. While predicting the tokens of y, GPT-2 produces a probability distribution v i over the vocabulary. The averaged entropy is then defined as where V is the vocabulary.

Surprisal
The second question we would like to address is: how surprising it is when the language model actually generates the punchline? As the incongruity theory states, laughter is caused when something incongruous is observed and it violates the previously established expectation. Therefore, we expect the probability of the language model generating the actual punchline to be relatively low, which indicates the surprisal value should be high. Formally, the surprisal is defined as log v y i i . (2)

Experiments
We evaluated and compared the proposed features with several baselines by conducting experiments in two settings: predicting using individual features, and combining the features with a content-based text classifier.

Baselines
Similar to our approach of analyzing the relationship between the set-up and the punchline, Mihalcea et al. (2010) proposed to calculate the semantic relatedness between the set-up and the punchline. The intuition is that the punchline (which delivers the surprise) will have a minimum relatedness to the set-up. For our experiments, we chose two relatedness metrics that perform the best in their paper as our baselines, plus another similarity metric based on shortest paths in WordNet (Miller, 1995): • Leacock & Chodorow similarity (Leacock and Chodorow, 1998), defined as where length is the length of the shortest path between two concepts using node-counting, and D is the maximum depth of WordNet.
• Wu & Palmer similarity (Wu and Palmer, 1994) calculates similarity by considering the depths of the two synsets in WordNet, along with the depth of their LCS (Least Common Subsumer), which is defined as where C 1 and C 2 denote synset 1 and synset 2 respectively.
• Path similarity (Rada et al., 1989) is also based on the length of the shortest path between two concepts in WordNet, which is defined as In addition to the metrics mentioned above, we also consider the following two baselines related to the phonetic and semantic styles of the input text: • Alliteration. The alliteration value is computed as the total number of alliteration chains and rhyme chains found in the input text (Mihalcea and Strapparava, 2005).
• Ambiguity. Semantic ambiguity is found to be a crucial part of humor (Miller and Gurevych, 2015). We follow the work of Liu et al. (2018) to compute the ambiguity value: log w∈s num of senses(w), where w is a word in the input text s.

Dataset
We took the dataset from SemEval 2021 Task  (2) We only kept punchlines whose percentage of alphabetical letters is greater than or equal to 75%; (3) We discarded punchlines that do not begin with an alphabetical letter. As a result, we obtained 3,341 examples in total, consisting of 1,815 jokes and 1,526 non-jokes. To further balance the data, we randomly selected 1,526 jokes, and thus the final dataset contains 3,052 labeled examples in total. For the following experiments, we used 10-fold cross validation, and the averaged scores are reported.

Predicting Using Individual Features
To test the effectiveness of our features in distinguishing jokes from non-jokes, we built an SVM classifier (parameters can be found in Appendix A) for each individual feature (uncertainty and surprisal, plus the baselines). The resulted scores are reported in Table 1. Compared with the baselines, both of our features (uncertainty and surprisal) achieved higher scores for all the four metrics.
In addition, we also tested the performance of uncertainty combined with surprisal (last row   ), and the resulting classifier shows a further increase in the performance. This suggests that, by jointly considering uncertainty and surprisal of the set-up and the punchline, we are better at recognizing jokes.

Boosting a Content-Based Classifier
Now that we have shown the advantage of our features when used individually in prediction, we would like to validate their effectiveness when combined with the commonly used word embeddings. Thus, we evaluated our features as well as the baselines under the framework of a content-based classifier. The idea is to see if the features could further boost the performance of existing text classifiers.
To create a starting point, we encoded each set-up and punchline into vector representations by aggregating the GloVe (Pennington et al., 2014) embeddings of the tokens (sum up and then normalize by the length). We used the GloVe embeddings with dimension 50, and then concatenated the setup vector and the punchline vector, to represent the whole piece of text as a vector of dimension 100. For each of the features (uncertainty and surprisal, plus the baselines), we appended it to the GloVe vector, and built an SVM classifier to do the prediction. Scores are reported in Table 2. As we can see, compared with the baselines, our features produce larger increases in the performance of the content-based classifier, and similar to what we have observed in Table 1, jointly considering uncertainty and surprisal gives further increase in the performance.

Visualizing Uncertainty and Surprisal
To get a straightforward vision of the uncertainty and surprisal values for jokes versus non-jokes, we plot their histograms in Figure 3 (for all 3,052 labeled examples). It can be observed that, for both uncertainty and surprisal, jokes tend to have higher values than non-jokes, which is consistent with our expectations in Section 4.

Conclusion
This paper makes an attempt in establishing a connection between the humor theories and the nowadays popular pre-trained language models. We proposed two features according to the incongruity theory of humor: uncertainty and surprisal. We conducted experiments on a humor dataset, and the results suggest that our approach has an advantage in humor recognition over the baselines. The proposed features can also provide insight for the task of two-line joke generation-when designing the text generation algorithm, one could exert extra constraints so that the set-up is chosen to be compatible with multiple possible interpretations, and the punchline should be surprising in a way that violates the most obvious interpretation. We hope our work could inspire future research in the community of computational humor.

A Model Parameters
For the SVM classifier, we set the regularization parameter C = 1.0, and used the RBF kernel with the kernel coefficient γ = 1/n features . All models were trained and evaluated on a machine with Intel Core i7-6700K CPU, Nvidia GeForce GTX 1080 GPU, and 16GB RAM. The running time of each method is listed in Table 3 and Table 4.