Duluth at SemEval-2020 Task 7: Using Surprise as a Key to Unlock Humorous Headlines

We use pretrained transformer-based language models in SemEval-2020 Task 7: Assessing the Funniness of Edited News Headlines. Inspired by the incongruity theory of humor, we use a contrastive approach to capture the surprise in the edited headlines. In the official evaluation, our system gets 0.531 RMSE in Subtask 1, 11th among 49 submissions. In Subtask 2, our system gets 0.632 accuracy, 9th among 32 submissions.


Introduction
Humor detection is a challenging problem in natural language processing. SemEval-2020 Task 7 (Hossain et al., 2020a) 1 focuses on detecting humor in English news headlines with micro-edits. Specifically, the edited headlines have one selected word or entity that is replaced by editors, which are then graded by the degree of funniness. Accurate scoring of the funniness from micro-edits can serve as a footstone of humorous text generation (Hossain et al., 2020a).
Inspired by the incongruity theory (Veale, 2004;Morreall, 2016), we believe that contrast and surprise is a key ingredient of humor. We instantiate this intuition with a contrastive framework. We then systematically compare three widely used models: CBOW, BERT (Devlin et al., 2019), and RoBERTa , providing a benchmark for this task. Our best system, based on RoBERTa, achieves compelling performance for both subtasks. Our code is available on GitHub. 2

Related Work
Early humor recognition systems are mostly based on traditional machine learning methods, such as support vector machine, decision tree, Naive Bayes, and k-nearest neighbors (Castro et al., 2016). Besides, an n-gram language model shows good performance (Yan and Pedersen, 2017) in learning a sense of humor from tweets. Yet n-gram models are limited to a small number of context words.
Pretrained language models based on Transformer (Vaswani et al., 2017) can obtain contextual information of a whole sentence. Among this family, BERT has been used to assess the humor in tweets and jokes (Mao and Liu, 2019;Weller and Seppi, 2019). Enlightened by these recent advances, we use BERT to judge the funniness of edited news headlines. We additionally experiment with RoBERTa, a robustly optimized variant of BERT.
Lastly, several works also attempt to explicitly model incongruity and surprise of humourous text, focusing on homophonic puns. Kao et al. (2016) formalizes incongruity as a mixed effect of ambiguity and distinctiveness, quantified by entropy and Kullback-Leibler divergence. He et al. (2019) proposes a local-global surprisal measure based on the log-likelihood ratio, to assess whether a sentence is a pun. However, we focus on a broader definition of humor, and formulate incongruity as an input pair to a dual encoder framework.

Task Data
The Humicroedit dataset (Hossain et al., 2019) provides the training, development, and test data for this task. We also use additional training data from the FunLines dataset (Hossain et al., 2020b). The dataset statistics are summarized in Appendix A. In Subtask 1, the goal is predicting the funniness score of an edited headline. The score z ranges from 0 to 3, where 0 means not funny and 3 means very funny. In Subtask 2, the goal is to predict the funnier between an edited sentence pair. For labels y ∈ {0, 1, 2}, 0 implies two headlines are equally funny, 1 implies the first is the funnier, and 2 implies the second is the funnier. Examples of the two subtasks are in Table 1.

Methods
What are the important characteristics of humor? The incongruity theory, a dominant theory of humor, states that "it is the perception of something incongruous-something that violates our mental patterns and expectations" (Morreall, 2016). Therefore, we hypothesize an edited headline is funny if the edit words are semantically distant from the context words or the original words. This can be exemplified by the first two examples in Table 1. We start by looking at Headline 33210 (the MONKEY EXAMPLE), also shown below. The context sentence is extracted from the edit sentence by replacing the edit words with a single [MASK] token, motivated by masked language models. Humans are likely to predict a place or a character for the masked token, while the edit token is "monkeys". Similarly for Headline 1664, given the context WHAT IF [MASK] HAD AS MUCH INFLUENCE AS ECONOMISTS, humans might fill in occupation-related words like "scientist" or "sociologist" (as in the original headline). However, the edit word is "donkeys", which is a surprising prediction and is considered very funny (scored 2.8 out of 3). In this section, we describe a concrete architecture that models the strength of contrast and surprise, which translates into the funniness score.

Span Representation
. . x Tc ) denote the original, edit, context token sequences. A pretrained word embedding or pretrained encoder maps the tokens into vector sequences e 1:To ,ẽ 1:Te , e 1:Tc . The goal is to encode edit sentence, original sentence, context sentence into fixed-length vector representations u, v , v ∈ R d . Importantly, we use span (a.k.a. sub-sentence) representation rather than whole sentences, which corresponds to the underlined ranges in the above MONKEY EXAMPLE. Denote a span as a tuple of start and end position of contiguous tokens:  Figure 1: Transformer architecture to predict the funniness score of an edited headline using edit-context sentence pair. The full edited sentence is "What if donkeys had as much influence as economists". "donkeys" is the edited word and is tokenized into two subwords in this example.
We max pool all context words to extract the most salient features. The context vector is v = MaxPool(e 1 . . . e i−1 , e i+1 . . . e Tc ).
Transformer We use pretrained transformer-based language models to obtain contextual word representations. The architecture is shown in Figure 1, a Siamese network (Bromley et al., 1994) where the two encoders have identical structures and shared parameters. 3 With the self-attention mechanism, each word attends to all other words in the sentence and aggregates contextual information. The edit and original vectors are obtained by averaging: The context vector is simply the contextual embedding of the masked token: v = e i . We experiment with BERT and RoBERTa, using the PyTorch (Paszke et al., 2019) implementation from HuggingFace Transformers library (Wolf et al., 2019). 4 We use bert-base-uncased (L = 12, d = 768, lower-cased) and roberta-base (L = 12, d = 768).
Transfer Paradigm When using those pretrained word representations, we consider two transfer paradigms: finetuning (FINETUNE) and not finetuning (FREEZE). In the case of FREEZE, we use fixed word embedding directly as the feature for CBOW. For transformers, we use a weighted average of hidden layers from the frozen encoder, with trainable mixing scalars. This approach follows ELMo (Peters et al., 2018) and the edge probing model (Tenney et al., 2019). Specifically, the final aggregated embedding for i th position is e i = γ L l=0 α l e l,i , where γ is a scaling factor, α l is the weight of the l th layer, L is the total number of layers, and l = 0 corresponds to the embedding layer.

Task Specific
Regression As mentioned at the beginning of the section, contrast and surprise is the key to humor. To represent the pairwise relationship between two vectors, we derive feature from h = f (x, y) = [x; y; |x − y| ; x * y] ∈ R 4d , where ; denotes concatenation and * denotes element-wise multiplication. This feature has been used as the input to the classifier in the sentence pair tasks of SentEval (Conneau and Kiela, 2018). To formulate the contrast pair, we either use edit sentence and its context f (u, v), or edit sentence and original sentence f (u, v ). We denote the two scenarios as CONTEXT and ORIGINAL respectively. Finally, we use a classifier to predict the funniness score of the edited headline.ẑ = Classifier(h) ∈ R. The classifier is a two-layer MLP with 256 hidden dimensions. When finetuning transformers we use single-layer linear projection instead, since its large number of parameters have already given us sufficient flexibility. The optimization objective is mean squared error L = z −ẑ 2 .  Classification In Subtask 2, we use the same method to predict the scores of two edited versionsẑ (1) andẑ (2) . By comparing the scores, the funnier version is found during evaluation and testing time: y = arg max i∈{1,2}ẑ (i) . The loss function is L = z (1) −ẑ (1) 2 + z (2) −ẑ (2) 2 .

Metrics
For Subtask 1, the primary metric for official ranking is Root Mean Squared Error (RMSE). In addition, we calculate Spearman's rank correlation coefficient which measures the monotonic relationship between predicted scores and true scores. In the evaluation of Subtask 2, instances with label 0 are ignored. The primary metric for official ranking is accuracy. As an auxiliary metric, reward takes pairwise score differences into account: i |, where y i andŷ i are true labels and predicted labels respectively, and z (1) i and z (2) i are true scores.

Official Evaluation
For the official evaluation, our submitted system is RoBERTa-FREEZE-CONTEXT. We use Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e-3 and use the 10 th epoch. Our system gets 0.531 RMSE for Subtask 1 (11 th among 49 submissions) and 0.632 accuracy for Subtask 2 (9 th among 32 submissions)  Table 3: Comparison between non-contrastive and contrastive approaches, based on RoBERTa. CONTEXT and ORIGINAL are contrastive, using the edit-context pair and the edit-original pair as input respectively. EDIT is non-contrastive, using the edit sentence only. The results are from test sets of the two subtasks. We tune other hyperparameters on the validation sets and select the best model for each cell.
on the test set. 5

Post-Evaluation
In the post-evaluation phase, we conduct a more extensive search on hyperparameters and select the best models based on validation performance. Experiment details and hyperparameters are in Appendix B. We systematically compare CBOW, BERT, and RoBERTa, and perform an ablation study to understand the effects of various factors: extra training data, finetuning or freezing the pretrained embeddings, and using CONTEXT or ORIGINAL feature. The post-evaluation results on the test set are in Table 2.
Contextual Representation Despite its simplicity, CBOW is surprisingly effective. Its best result is significantly better than the baseline and is comparable to Subtask 1 #19 (0.547) and Subtask 2 #17 (0.605) on the leaderboard. By comparing the three models, we see that pretrained language models have better performance than context-independent word embedding. While results for BERT and RoBERTa are similar, both of them outperform CBOW, evidencing that contextual information is essential for humor detection.
CONTEXT vs. ORIGINAL In the ablation study, we first notice that neither finetuning nor using extra data from the FunLines dataset make much difference for all models. Interestingly, using different contrast pairs as the feature has different effects on models. CONTEXT is better than ORIGINAL for CBOW, yet they are similar for pretrained language models. Why does this happen? CBOW with ORIGINAL only uses the information of edited word and original word while completely neglecting the contextual relation. Pairing CBOW with CONTEXT can alleviate this limitation. On the other hand, pretrained language models exploit the contextual relation between edit words and context words in both cases.
6 Analysis and Discussion

Non-contrastive Approach
In our main experiment, we focus on the contrastive approach using a sentence pair (i.e., CONTEXT and ORIGINAL) and show its effectiveness. The remaining question is, can we predict humor using the edited sentence as the only input? Thus, we investigate a non-contrastive approach, with a single encoder to obtain the span representation of the edit sentence. We refer to this as EDIT. This is equivalent to only using the left part in Figure 1. We conduct an experiment with RoBERTa on both subtasks. The results are in Table 3. We see that EDIT has similar performance as contrastive approaches. We conjecture that EDIT captures contrast implicitly, while CONTEXT and ORIGINAL capture contrast explicitly by design.

Error Analysis
To understand the relationship between human judgment and model predictions, we calculate the correlation matrix between true funniness scores and predicted scores from RoBERTa with different features (CONTEXT, ORIGINAL, and EDIT) for Subtask 1. From Table 4, we see that the models correlate poorly with human judgment (correlations ≈ 0.4), while correlating well with each other (correlation ≈ 0.8).   To further learn when the models make an erroneous judgment, we look at model predictions on the test set of Subtask 1. We see the models generally capture the incongruity phenomenon. While being key to many examples, incongruity does not account for others. We summarize some typical examples in Table 5: • For Headline 9100, the edit word "children" is incongruous with the context of national security.
While humans consider it not funny at all, models assign a high funniness score. The fallacy is that incongruity is not a sufficient condition for humor.
• In other cases, the edit words are congruous with the context. While humans consider them very funny, models predict the opposite. That is, incongruity is not a necessary condition for humor. Humor has diverse underlying causes. For instance, Headline 12685 shows sarcasm, taunting Trump's lack of geography knowledge and common sense. Headline 12271 uses pun based on polysemy: "turkey" can either mean a country (when capitalized) or a bird. Also, humor can require an understanding of cultural commentary, exemplified by Headline 9406. Since the Cheesecake Factory is a large chain of restaurants that some may look down upon, they are happy to see it blown up with "no complaints".

Conclusions
We use incongruity as the key to assessing funniness in edited news headlines. Specifically, we use pretrained transformer-based language models to encode contrastive pairs. Our best performing model is RoBERTa, which is submitted for the official evaluation and achieves competitive performance in both subtasks. The additional experiment shows that a non-contrastive approach may also encode incongruity implicitly. While incongruity is a common ingredient of humor, error analysis indicates it is neither sufficient nor necessary. This invites future research to take other factors (e.g., sarcasm, pun, or world knowledge) into account to better tackle humor, an intricate phenomenon rooted in human creativity.

B Experiment Details
Preprocessing We use spaCy word tokenizer for CBOW. The pretrained transformers use byte-pair encoding (Sennrich et al., 2016, BPE) to convert text into subword units. BERT uses WordPiece (Wu et al., 2016) tokenization, a character-level BPE, with a vocabulary size of 30K. RoBERTa preserves cases and uses a byte-level BPE with a vocabulary size of 50K.
Training For training, we use a batch size of 32 in Subtask 1 and 16 in Subtask 2. We use Adam optimizer and perform gradient clipping with a max 2 norm of 5. For most experiments, we train for 10 epochs with a learning rate in {1e-3, 3e-4}. However, when finetuning transformers, we choose max epochs in {3, 10}, and use either a constant learning rate or a linear decreasing schedule with an initial learning rate in {2e-5, 5e-5}. We perform validation on the development set every 1/3 epoch and save the best checkpoint.