ExPUNations: Augmenting Puns with Keywords and Explanations

The tasks of humor understanding and generation are challenging and subjective even for humans, requiring commonsense and real-world knowledge to master. Puns, in particular, add the challenge of fusing that knowledge with the ability to interpret lexical-semantic ambiguity. In this paper, we present the ExPUNations (ExPUN) dataset, in which we augment an existing dataset of puns with detailed crowdsourced annotations of keywords denoting the most distinctive words that make the text funny, pun explanations describing why the text is funny, and fine-grained funniness ratings. This is the first humor dataset with such extensive and fine-grained annotations specifically for puns. Based on these annotations, we propose two tasks: explanation generation to aid with pun classification and keyword-conditioned pun generation, to challenge the current state-of-the-art natural language understanding and generation models’ ability to understand and generate humor. We showcase that the annotated keywords we collect are helpful for generating better novel humorous texts in human evaluation, and that our natural language explanations can be leveraged to improve both the accuracy and robustness of humor classifiers.


Introduction
Humor serves multiple purposes and provides numerous benefits, such as relieving anxiety, avoiding painful feelings and facilitating learning (Buxman, 2008).As a specific example of humor, the creative uses of puns, wordplay and ambiguity are important ways to come up with jokes (Chiaro, 2006).Pun understanding and generation are particularly challenging tasks because they require extensive commonsense and world knowledge to compose and understand, even for humans.Despite growing

Text
When artists dream in color it's a pigment of their imagination.

NLEx
Pigments are non-soluble materials often used in painting, and pigment sounds like figment, which is something that is not real but someone believes it is.

Text
The man found something to catch fish, which was a net gain.
KWD catch fish , net gain .
NLEx This is a play on words.A "net gain" means an increase in revenue but here "net" refers to how a net is used to catch fish.
Table 1: Two examples of annotated Keywords (KWD) and Natural Language Explanations (NLEx) for puns in our dataset.The highlighted texts are annotated keywords that contribute to making the text funny.
interest in the area, there are limited amounts of data available in the domain of humor understanding and generation.
Existing humor datasets are usually only annotated with binary labels indicating whether each sentence is a joke, pun, or punchline (Hasan et al., 2019;Weller and Seppi, 2019;Castro et al., 2018;Mittal et al., 2021).This is insufficient to benchmark models' ability to understand and generate novel humorous text, since hardly anything meaningful can be learned from such a sparse supervision signal and coarse-grained annotation.
To facilitate research on humor understanding and generation, we present the ExPUNations (Ex-PUN) dataset, in which we augment an existing dataset of puns from SemEval 2017 Task 7 (Miller et al., 2017) with detailed crowdsourced annotations of fine-grained funniness ratings on a Likert scale of one to five, along with keywords denoting the most distinctive words that make the text funny and natural language explanations describing why the text is funny (Table 1).In addition, we collect annotations indicating whether a person understands the sentence, thinks it is a pun, and finds Table 2: Two examples with annotation fields that we collect.We use underline to mark the commonsense knowledge that people need in order to understand the joke.
the joke offensive or inappropriate.Since these tasks are all highly subjective, we collect multiple annotations per sample, and present a detailed agreement analysis.We believe our annotations can be used in many other applications beyond pun understanding and generation, such as toxicity detection.
The contributions of our work are threefold: • We contribute extensive high-quality annotations for an existing humor dataset along multiple dimensions. 1 • Based on the annotations, we propose two tasks, explanation generation for pun classification and keyword-conditioned pun generation, to advance research on humor understanding and generation.
• We benchmark state-of-the-art NLP models on explanation generation for pun classification and keyword-conditioned pun generation.
Our experiments demonstrate the benefits of utilizing natural language keywords and explanations for humor understanding and generation while highlighting several potential areas of improvement for the existing models.

ExPUN Dataset
In this section, we describe our data annotation procedure, including details of the annotation fields and our assessment of the annotation quality.
The dataset also contains examples of non-pun text.
We sample 1,999 text samples from SemEval 2017 Task 7 as the basis for our humor annotation.3

Dataset Annotation
The annotated fields (AF ) come in the order of: AF 1 [understandability]: whether the annotator understands the text or not, regardless of whether they perceive it as funny.
AF 2 [offensiveness]: whether the annotator finds the text offensive or inappropriate.
AF 3 [joke]: whether the annotator thinks the text is intended to be a joke.
AF 4 [funniness]: rate the funniness on a Likert scale of 1-5, where 1 means very not funny and 5 means very funny.
AF 5 [explanation]: explain in concise natural language about why this joke is funny.More specifically, if external or commonsense knowledge is required to understand the joke and/or its humor, the annotator should include the relevant knowledge in the explanation.If the joke is a pun or play on words, they must provide an explanation of how the play on words works.
AF 6 [joke keywords]: pick out (as few as possible) keyword phrases from the joke that are related to the punchline/the reason the joke is funny.We emphasize that phrases should be sparse and mainly limited to content words, can be multiple words long, and the keywords should be copied verbatim from the joke.
If an annotator rates the instance as not understandable, they will skip the rest of the annotation for that instance (AF 2 -AF 6 ).In addition, if an annotator rates an example as not a joke, they can skip the rest of the annotation (AF 4 -AF 6 ).Table 2 shows two examples in our dataset.The first example has two annotators who think the text is a joke, and therefore it has two explanations.In the second instance, all annotators unanimously agree it is a joke.Here, we sample two explanations from the original five.For both instances, we use underline to highlight the external commonsense knowledge in the explanation.If the joke is a play on words, the explanation also shows how the play on words works (e.g., the second joke).We show the full annotation guidelines, including calibrating examples, in Appendix A.
We crowdsourced 5 annotations per sample using a professional team of 10 dedicated full-time annotators within our organization.Before starting the task, we held a kick-off meeting with the team to explain the annotation guidelines in detail.We then conducted 3 pilot rounds for calibration and iteratively met with annotators, including more details and examples to address annotator questions. 4inally, we conducted 7 rounds of annotation, each with between 100-300 puns per round grouped into minibatches of 50 examples.Each sample in a minibatch was annotated by consistent subteams of 5 annotators.After receiving a completed batch of annotations, we manually examined their quality and provided feedback on any quality issues, redoing batches as necessary.

Dataset Statistics and Quality Control
We report overall dataset statistics in Table 3.For AF 1 − AF 3 , we count the number of samples labeled positive by majority vote.For AF 4 , we compute the average of all funniness scores, excluding blank annotations, and find that while annotators recognized most samples as jokes, they did not find them to be particularly funny.For AF 5 and AF 6 , we compute lexical statistics of our explanations and keyword annotations and provide deeper analysis of these key annotation fields in Section 2.4.We report inter-annotator agreement for all annotation fields in Table 4. 5 For fields AF 1 -AF 4 , we compute agreement using (1) the average of Cohen's kappa scores of each annotator against the majority vote, and (2) the average Spearman correlation between each pair of annotators.We find that annotators show moderate agreement when deciding if the given text is a joke (AF 3 ), but lower agreement on the task of understanding the text (AF 1 ) as well as the much more subjective task of rating how funny a joke is (AF 4 ).We also find weak average Spearman correlation between each pair of annotations for the subjective categories of offensiveness (AF 2 ), 6 whether the text is a joke (AF 3 ) and joke funniness (AF 4 ).
For the free text fields in AF 5 and AF 6 , we compute averaged BLEU-4 (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) scores in a pairwise fashion.We treat each annotator's explanation (for AF 5 ) or list of keyword phrases joined into a string (for AF 6 ) as candidate text, with the remaining annotators' annotations as a set of references.We find high similarity between joke keyword annotations, suggesting that annotators identify similar spans of keyword phrases, and a lower degree of similarity between pun explanations.

Dataset Analysis
Explanations.As seen in Figures 1a and 1b, on average, samples are annotated with multiple explanations, and the explanations are lengthy, spanning multiple sentences, and lexically diverse (14,748 token vocabulary size, with 210,580 tokens overall).Figure 3 in Appendix B shows the distribution of the top 50 most frequent content-words in our explanations.The frequent use of usually and often indicate the explanation of commonsense knowledge, e.g., thunder and lightning are usually present in a weather storm or "pain" means physical discomfort often felt by a hospital patient.
The most frequent words, means and word, indicate that annotators frequently provide word sense information as part of their explanations, while sounds frequently appears in explanations of heterographic puns.Each of these most frequent words comprise less than 2.8% of all tokens in the explanations, illustrating the rich diversity of our corpus.7 Keywords.As seen in Figures 1c and 1d, on average, keyword phrases in ExPUN, which are derived from the original puns, are short and sparse (5,497 token vocabulary size, with 27,820 tokens overall).This follows from our guidelines to annotate keywords concisely, focusing mainly on content words that are essential to understanding the joke.Table 5 shows two examples of pun keyword annotations in our dataset that showcase different annotation styles among annotators.For instance, one annotator may tend to select wordy keyword phrases that introduce unnecessary tokens, while another may omit salient keywords that other annotators mention.Aggregating these annotations among annotators to construct a single ground truth set of keyword phrases is therefore challenging because of differing annotation styles.The problem of merging keywords is further complicated because the keywords from different annotators are often not aligned well, as different annotators may annotate varying numbers of keyword phrases and different spans.Taking these considerations into account, we propose a keyword aggregation algorithm to address these issues and construct a single set of aggregated keywords per sample.
Keywords Aggregation.Algorithm 1 in Appendix C describes our keyword aggregation method.The algorithm aims to generate a comprehensive list of concise keywords for each sample.First, we compute a reliability score for each annotation, defined as the average of (# keyword phrases−# average tokens in each keyword phrase).The higher the score, the more comprehensive and concise the keywords from an annotator should be.
We choose the annotator with the highest score to be the anchor.We note, however, that keyword annotations are not always error-free; e.g., in the first example of Table 5, w 4 has an incorrect word (fancy chairs instead of royal chairs).Therefore, for each keyword phrase, we compute the fuzzy matching score between the anchor's annotation with the rest of annotators' annotations.For each annotator, we keep the keyword phrase that has the highest fuzzy matching score with the anchor annotator's, with a minimum threshold score of 60.8This process produces a filtered keyword list where each of the remaining keyword phrases look similar to the anchor's.Then, we compute the average fuzzy matching score between the anchor's keyword phrase and each element in the filtered keyword list.We then choose the annotator with the second-highest reliability score to be the anchor, and repeat the above process.Finally, by choosing the resulting keyword phrases that attain the maximum average fuzzy matching score between the first and second anchors, we get the final aggregated keywords for this instance.

Experiments
With the collected annotations, we propose two new tasks, pun explanation and keyword conditioned pun generation, to showcase novel tasks that our dataset uniquely enables and push the frontiers of NLU and NLG for humor.Note that the rich annotations in ExPUN can also enable many other interesting tasks such us pun keywords extraction, fine-grained funniness prediction, and others.However, we prioritize NLG tasks as they are relatively under-explored compared to NLU tasks.In this section, we benchmark current state-of-the-art models' performance on the proposed tasks.

Pun Explanation
The task of pun explanation takes a pun sentence as input and outputs a natural language explanation of why the pun is funny.This requires extensive understanding of background and commonsense knowledge.We hypothesize that existing NLP models would struggle to generate high-quality explanations for puns.On the other hand, high-quality explanations can improve humor understanding, and thus help tasks such as humor classification.Formally, given text T , our target is to generate an explanation E T of why T is funny.Additionally, we use the explanations to support the task of pun classification, where, given T (and optionally an explanation E T ), we output whether T is a joke.
Data Preparation.For each data sample, we use the longest human-written explanation from Ex-PUN (AF 5 ), substituting in the pun text if no explanations exist.9For pun classification, we assign output labels using the majority vote of AF 3 (is a joke).For both tasks, we split our dataset into 1,699/100/200 for train/dev/test.Dev and test contain an equal distribution jokes to non-jokes, while training contains 1,299 jokes and 400 non-jokes.
Evaluation Metrics.We do not report lexical overlap metrics as our primary evaluation metric for generated explanations because these are not suited for measuring plausibility (Camburu et al., 2018;Kayser et al., 2021;Clinciu et al., 2021) or faithfulness of explanations (Jacovi and Goldberg, 2020).Rather, we follow prior work and use the "simulatability score" metric from Wiegreffe et al. (2021) to measure explanation quality from the lens of usability of the explanation.It reflects the utility of explanations by measuring the improvement in task performance when explanations are provided as additional input vs. when they are not: acc(IE → O) − acc(I → O), where I denotes the input text, E is the explanation and O is the classification of whether I is a joke.We evaluate how useful explanations can be by measuring the performance increase of acc(IE → O) as we increase the ratio of samples with explanations in the training data, and report acc(I → O) as a constant baseline that uses no explanations.

Models. We use the following model variations: 10
No explanations.As a baseline, we finetune BERTbase (Devlin et al., 2019), RoBERTa-base (Liu et al., 2019) and DeBERTa-base (He et al., 2021) to classify whether the given text is a joke without any explanations in the input.
Gold explanations.To find the upper bound of how useful explanations can be, we augment the input to the above baseline models with gold humanannotated explanations in both training and testing.The majority of non-punny examples (identified as unfunny by majority vote and thus labeled as unfunny) contain at least one explanation from an annotator who marked it as funny.In these cases,  (1) representing the missing explanation as an empty string ("w/ gold expl."), or (2) randomly sampling a negative explanation from another annotated example to use as input ("w/ gold + sampled neg.").
Generated explanations.Following previous work on explanation generation (Wiegreffe et al., 2021), we first finetune a T5 (Raffel et al., 2020) model to generate pun explanations given pun sentences as input.For text that contains no annotated explanations, we use the pun sentence itself as the output explanation.We then use gold human-annotated explanations to train and T5-generated explanations to test the explanation-augmented classification models.
ELV (Zhou et al., 2020a).ELV is a probabilistic framework for text classification where natural language Explanations are treated as Latent Variables.
Two modules, an explanation generation module and an explanation-augmented prediction module are jointly trained using a variational EM framework.As another baseline, we train an ELV model for pun classification using the ExPUN dataset.
Results.We show our results on the pun classification task in Figure 2. Baseline performance of the no explanations models are shown using constant dotted lines.Figure 2a shows the upper bound of performance improvement when models are provided with gold explanations, indicating that human-written explanations are useful for this task, and that including more gold explanations in training data generally helps.In particular, adding randomly-sampled negative explanations ("w/ gold + sampled neg.") further improves the classification accuracy, showing the utility of our collected explanations in improving model performance.However, Figure 2b shows that using generated explanations at test time does not help to improve classification accuracy.Using the more carefully-designed ELV framework to jointly train the generation and classification modules shows improvement in classification accuracy (Figure 2c); however, qualitative analysis of the ELV explanations showed that many generated outputs are not fluent natural language, suggesting that performance improvements may stem more from modeling improvements as opposed to explanations.Given the huge improvements we see when incorporating gold explanations during test, we note explanations are clearly highly valuable if the quality of generated explanations can be improved.
Table 6 shows examples of T5-generated explanations for given puns.Qualitative analysis shows that generated explanations often identify the relevant pun word, and can include somewhat accurate word sense information for one sense of the pun.However, the model usually fails to explain the alternate word sense and its relation, which is crucial to understanding the wordplay.The model especially fails to explain phonological similarity in heterographic puns; e.g., in the first three examples, explanations fail to mention alternate words carry, whet and humor.For both pun types, our model can devolve into repetitively copying words from the input.Our results exhibit the challenge of generating good pun explanations and that high-quality explanations are useful for understanding humor.

Keyword-Conditioned Pun Generation
The task of keyword-conditioned pun generation takes human-annotated pun keywords as input and

Generated Explanation
My name is Cary.I'm a porter.
The joke is a pun on the word "cary".A porter is someone who transports goods.
Fishers often wet their appetite.This is a play on words.The word "wet" means to wet your appetite, which is a characteristic of fish.
A gossip is someone with a great sense of rumor.This is a play on words.The word "rumor" sounds like "rumor".A gossip is someone who has a great sense of rumor.
Oil executives are always using crude language.
The joke is a pun on the word "crude".Crude language is used to describe crude oil, which is a type of petroleum product.
Please mix me a martini, said Tom, dryly.This is a play on words.The word "dryly" means dryly, but "dryly" sounds like "dryly".produces novel puns as output.This benchmarks models' capability to draw connections among words to generate novel fluent, sensible, and humorous texts.This is a challenging task with many downstream applications, such as context-situated humor generation, a task that involves generating humorous text in a given situation or context.In this case, input keywords can come from conversational context (e.g., chatbot dialogues) or narrative context (e.g., creative short stories).
More formally, we take as input keywords K, the pun word p w and alternate pun word a w ,11 and produce novel and fluent puns that incorporate the keywords.12Optionally, we also include pun word sense annotations S pw and S aw from the original SemEval 2017 Task 7 annotations.
Data Preparation.For this task, we limit our data to samples that contain both (1) annotated human keywords K from ExPUN (AF 6 ), and (2) pun word sense annotations S pw and S aw from Se-mEval 2017 Task 7.There are 1,482 such samples that have both annotations, from which we reserve 100 as test data and use the rest for model training.To construct input human-annotated keywords for this task, we aggregate keywords for each sample using the method described in Sec-tion 2.4.Additionally, we evaluate the effect of finetuning on automatically-extracted keywords instead of human-annotated keywords by automatically extracting keywords for each sample by running the RAKE (Rose et al., 2010) algorithm on the pun text.
Evaluation Metrics.We use both automatic metrics and human evaluation to evaluate the quality of generated puns.For automatic evaluation, we calculate word incorporation rate for both pun words and keywords, which measure the model's ability to incorporate all input keywords.Additionally, we run human evaluation using Amazon Mechanical Turk, in which we asked Turkers to label whether or not a given generated pun was successful. 13  Models.We use the following models: AmbiPun (Mittal et al., 2022).We use the current state-of-the-art homographic pun generation model, AmbiPun, with no further finetuning.We follow the AmbiPun prompt format: "generate sentence: K, p w , a w ".
Finetuned T5 (T5 FT ).We finetune T5-base on ExPUN using input prompt "generate a pun that situated in K, using the word p w , p w means S pw , a w means S aw ."The output is the pun itself. 14  Finetuned T5 with pretraining (T5 PT+FT ).To increase the model's ability to incorporate keywords, we pretrain T5 on non-pun text.For a given pun word, we first extract 200 sentences that contain the pun word from BookCorpus (Zhu et al., 2015), then use RAKE to automatically extract keywords for each sentence.We construct examples where inputs are automatically extracted keywords, and outputs are sentences from BookCorpus including pun words.We pretrain a T5 model with this data before finetuning it on ExPUN.
Results.Table 7 shows results of our pun generation models.While the AmbiPun baseline achieves superior word incorporation performance, our T5 PT+FT model finetuned using ExPUN keywords generates successful puns at a higher rate, showing the value of training on our dataset.Furthermore, while pun word incorporation is im-13 Turkers had to pass a qualifier by correctly labeling >= 80% of 20 samples that we manually annotated.Success is defined as whether the text supports both senses of the pun word.We measure inter-annotator agreement among 3 annotators using Fleiss' kappa (κ = 0.49), showing moderate agreement. 14Further experimental details in Appendix E. proved by pretraining on outside sources using RAKE keywords, using automatically-extracted keywords when training on in-domain pun text does not translate to more successful puns.Instead, models finetuned with the more carefullyselected, human-annotated ExPUN keywords generate puns relatively more successfully than their RAKE-trained counterparts.
Table 8 shows examples of generated puns from our ExPUN-T5 PT+FT model.The model is able to generate both homographic and heterographic puns somewhat coherently using one of the pun word senses.However, while some puns are successful, Rows 3 and 6 show some ways our model can struggle to generate the respective pun types: it does not always incorporate the alternate word sense in a clever or meaningful way, and can stitch copied input keywords together into incoherent sentences.Our results show pun generation is a very challenging task, and that careful selection of pun keywords and a deeper understanding of humor in wordplay is essential for generating puns successfully.

Related Work
In this work, we contribute annotations for a humor dataset as well as two humor-related generation tasks.The work is broadly related to pun generation, pun detection, explanation generation, and humor generation.We briefly summarize works in these directions.
Pun generation.Many of the previous works on pun generation have focused on phonological or syntactic patterns rather than semantic patterns (Miller and Gurevych, 2015;Hong and Ong, 2009;Petrović and Matthews, 2013;Valitutti et al., 2013)  Our keyword-conditioned pun generation task encourages models to focus more on the linguistic structures via pun keywords as we observe that human-extracted keywords usually reflect the ambiguity and distinctiveness principles as discussed in Kao et al. (2016).The keyword-conditioned pun generation setup can also facilitate more engaging pun generation scenarios such as context-situated pun generation (Sun et al., 2022).
Humor generation.With the recent advent of diverse datasets (Hasan et al., 2019;Mittal et al., 2021;Yang et al., 2021), it has become easier to detect and generate humor.While large pre-trained models have become fairly successful at detection, humor generation still remains an unsolved problem.Therefore, humor generation is usually studied in a specific settings.Petrović and Matthews (2013) generates jokes of the type 'I like my X like I like my Y, Z'.Garimella et al. (2020) develops a model to fill blanks in a Mad Libs format to generate humorous sentences and Yang et al. (2020) edit headlines to make them funny.More research is required to generate humorous sentences that are not constrained by their semantic structure.
Natural language explanation generation.Collecting and utilizing natural language explanations to help various NLP tasks is an emerging topic.The earliest work by Ling et al. (2017) collected natural language justifications, called rationales, to help solve math problems.However, their setup is limited to solving math problems given how their rationales and models were structured.Jansen et al. ( 2018) composed a dataset of explanation graphs for elementary science questions to support multi-hop inference.Like Ling et al. (2017), they emphasized the explanations structures.Several works have introduced large-scale datasets of natural language explanations for the natural language inference (NLI) (Camburu et al., 2018;Kumar and Talukdar, 2020), commonsense reasoning (Rajani et al., 2019), and hate speech detection (Mathew et al., 2021) tasks.However, there are no existing datasets or models that focus on explaining humor, which is a challenging task that involves commonsense and world knowledge.
Pun detection.Being able to detect puns can be an essential step to generating them.SemEval 2017 Task 7 (Miller et al., 2017) introduced the challenge of pun detection, location detection and sense interpretation for homographic and heterographic puns.They also released a dataset which has become the backbone of our and several other related works.Diao et al. (2019) make use of gated attention networks to detection heterographic puns.Zou and Lu (2019) introduce a tagging scheme to jointly detect and locate puns, and apply this approach to both heterographic and homographic puns.Zhou et al. (2020b) jointly model contextual and phonological features into a self-attentive embedding in their approach for pun detection and location tasks.

Conclusion
In this paper, we contribute a dataset of extensive, high-quality annotations of humor explanation, keywords, and fine-grained funniness ratings.This is the first humor dataset with such extensive and finegrained annotations.Based on the annotations, we propose two tasks: pun explanation and keywordconditioned pun generation, to challenge state-ofthe-art natural language understanding and generation models' ability to understand and generate humorous text.We benchmark several strong models' performances on the two proposed tasks to validate the practical usage of the proposed annotations, and show that our human-annotated explanations and keywords are beneficial in understanding and generating humor.Future directions include a deeper analysis of how to characterize pun explanation more objectively within our annotation scheme, as well as further exploration of better models for both the pun explanation and pun generation tasks.

Limitations
This work focuses on understanding and generation of puns, a single and very specific form of humorous language.We hope that our annotation schema and methods can be used in the future to extend to other forms of humor, e.g., joke generation.Additionally, we acknowledge that humor is a highly subjective area, i.e., what might be perceived as humorous may differ greatly from one person to another depending on their unique backgrounds and experiences.We hope this work can be used as an initial framework to begin characterizing humor through human-written explanations, such that it can be used more broadly to give insight into what contributes to humorous content for different individuals and groups.
Finally, since we use pretrained language models for our generation tasks, we note that this makes our models susceptible to generating biased or sensitive content.While we do not explicitly address concerns around bias/sensitive content within our framework to date, we aim to incorporate these considerations into pun generation as we develop new models, including methods to filter our inputs and generated data for toxicity and biased references that may be deemed offensive.

A ExPUN Dataset Annotation Details
A.1 Annotation Guidelines Below, we include the annotation guidelines we used to collect the ExPUN dataset.All pun texts in the provided examples are from the original Se-mEval 2017 Task 7 dataset (Miller et al., 2017). 15uidelines You will be provided a CSV file of short texts, one short text to be annotated per row.Each row contains the text content as well as columns for each of the requested annotations.For each row, read the text carefully, and provide the following annotations: 1. Mark whether you understood the text with 0/1 (0 didn't understand, 1 understood the text).3. Mark whether you think the text is intended to be a joke with 0/1 (0 not a joke, 1 is a joke).
• Text should be labeled as 1 (is a joke) even if it intends to be humorous, but falls flat or is a lame/bad joke.
• Example text labeled 0 (not a joke): All that glistens is not gold.• Example text labeled 1 (is a joke): These are my parents, said Einstein relatively.
Why is this a joke?Though subtle, the text is a pun on the word "relatively" that associates Einstein with his relatives (parents) and his theory of relativity.• If you rate this sample as 0 (not a joke), skip the rest of the annotation for this sample.
• Score of 1: A very not funny joke consists of a joke that is not funny at all, or tries to be funny but does not achieve the intended effect.
Example of Funniness 1 (not funny): These are my parents, said Einstein relatively.• Score of 3: An average joke consists of a joke that that is average and may elicit some chuckles (or groans) from you or others.
Example of Funniness 3 (average funniness): When they told him that his drum couldn't be fixed, it didn't resonate very well.• Score of 5: A very funny joke consists of a good joke that you find humorous and potentially would want to share/tell to others.

Example of Funniness 5 (very funny):
Yesterday I accidentally swallowed some food coloring.The doctor says I'm OK, but I feel like I've dyed a little inside.
5. Explain in concise natural language about why this joke is funny.If external or commonsense knowledge is required to understand the joke and/or its humor, please include the relevant knowledge in your explanation.If the joke is a pun or play on words, you must provide an explanation of how the play on words works.
• Example joke: What do you use to cut a Roman Emperor's hair?Caesars.

Bad explanation:
The joke is a play on words about Caesar and scissors.

Good explanation:
The joke is a play on words: Caesar was a Roman Emperor, and "Caesars" sounds like "scissors", which is something you use to cut hair.• Example joke: There was a kidnapping at school yesterday.Don't worry, though -he woke up! Bad explanation: The joke is a play on words about kidnapping → kid napping.

Good explanation:
The joke is a play on words.The word "kidnapping" implies that a kid was taken hostage at school, but "he woke up" suggests that it was actually just a kid taking a nap instead.
6. Pick out (as few as possible) keyword phrases from the joke that are related to the punchline/the reason of the joke being funny (written as a pipe-separated (|) list of phrases with spaces).
• Phrases can be multiple words long.
• The keyword phrases should be copied verbatim from the joke (no need to reword them).that can aid in the pun classification task (e.g., a detailed, contextualized definition of a word/phrase).

C Keyword Aggregation Algorithm
We propose the keyword aggregation algorithm in Algorithm 1 to merge keywords annotation among different workers.

D Classifier Implementation Details
We finetune pretrained language models for classifying whether given text samples are jokes, and we use HuggingFace (Wolf et al., 2020)  We choose the checkpoint with the best accuracy on the dev set for inference.
For ELV model, we use the released code and inherited most of their default hyperparameters for ELV-sa. 16We change the training batch size per GPU to 4 to accelerate the training.
Figure 1: Distributions of (a) number of tokens and (b) number of sentences in explanations (AF 5 ), (c) tokens in keyword phrases (AF 6 ), and (d) keyword phrases per sample.Horizontal lines are used to show the min, mean, and max values for each distribution.
(a) Gold explanations during test.(b) Generated explanations during test.(c)ELV model.

Figure 2 :
Figure 2: The impact of using human-written (2a) and model-generated explanations (2b and 2c) vs. no explanations (constant dotted lines) on pun classification accuracy.All reported numbers are computed with three-seed average.For each data point, we train a model on the full dataset, but only provide explanations for a given percentage, as shown on the x-axis.weuse any provided explanations as E, both in training and in testing with gold explanations.Otherwise, to construct training examples that have no annotated explanations, or where explanations are held out, we try two variants: (1) representing the missing explanation as an empty string ("w/ gold expl."), or (2) randomly sampling a negative explanation from another annotated example to use as input ("w/ gold + sampled neg.").

Table 3 :
Overall stats for annotation fields in ExPUN.

Table 4 :
Agreement stats for annotated fields in the ExPUN dataset.We report averaged Cohen's κ and Spearman's ρ for numeric ratings (AF 1 − AF 4 ), and averaged BLEU-4 and METEOR for text fields (AF 5 − AF 6 ).

Table 5 :
Keyword annotations from different workers.
w A shows aggregated keywords from our algorithm.

Table 6 :
Pun explanations generated by the T5 model.We use underline to indicate the pun word in the input. ,

Table 8 :
Examples of input pun words and keywords and the resulting generated puns.We show examples of both homographic and heterographic generated puns.
thus lacking flexibility.He et al. (2019) make use of local-global surprisal principle to generate homophonic puns and Yu et al. (2020) uses constrained lexical rewriting for the same task.Hashimoto et al.
Explanation: The joke is a pun.The main character feels they've "died a little inside" meaning they've been changed for the worse by swallowing food coloring.At the same time, food coloring contains dye, so the main character has been "dyed" on the inside by swallowing some.Keywords: swallowed | food coloring | dyed a little inside• Text: Waiter, there's a fly in my soup!"I know.It gives you a nice buzz doesn't it?"Explanation:This is both a pun and a reference to a common joke format."Waiter,there'safly in my soup!" is an old joke setup with varying punchlines.Flies make a noise commonly described as a "buzz"."Buzz"canbeusedasanounreferring to a pleasant heightened sensation, commonly from drinking alcohol.Keywords: fly | soup | buzz• Text: The evil onion had many lairs.Explanation:This is a pun.An evil lair is a hideout for a villain in a comic book or show.Onions are layered vegetables.The joke is that the onion had many lairs because it was evil.Keywords: evil onion | many lairs• Text: Hope for the best, but prepare for the worst.Additional calibrating examplesThe following examples were rated with an average Funniness rating >= 2 in previous pilot rounds and can be used to calibrate your rubric for assigning Funniness scores.

Table 9 :
Sample explanation sentence templates collected in ExPUN, along with their frequencies.