AmbiPun: Generating Humorous Puns with Ambiguous Context

In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin.


Introduction
Computational humor has garnered interest in the NLP community (Petrović and Matthews, 2013;Miller et al., 2017;Zou and Lu, 2019;Garimella et al., 2020;Yang et al., 2020).In this paper, we tackle the problem of generating homographic puns (Miller et al., 2017): two or more meanings of a word form an intended humorous effect.For example, the three puns listed in Figure 1 exploit two contrasting meanings of the word sentence: 1) a grammatical string of words and 2) the punishment by a court assigned to a guilty person.
Due to the lack of sizeable training data, existing approaches are heavy-weighted in order to not rely on pun sentences for training.For example, (Yu et al., 2018) train a constrained neural language model (Mou et al., 2015) from a general text corpus and then use a joint decoding algorithm to promote puns.He et al. (2019) propose a local-global surprisal principle, and Luo et al. (2019)

leverage the
Sense 1 Definition a string of words that is complete in itself, typically containing a subject and predicate Sense 2 Definition (criminal law) a final judgment of guilty in a criminal case and the punishment that is imposed Ours 1 The sentence is ungrammatical.The jury didn't hear it.
Ours 2 I'm sorry I said the sentence was too long but punishments are endless.

Human
The Judge has got a stutter.Looks like I am not getting a sentence.
Figure 1: An illustration of homographic puns.The target pun word 'sentence' and the two sense definitions are given as input.To make the target word interpretable in both senses, we propose to include context words (highlighted in blue and pink) related to both senses.
Generative Adversarial Nets (Goodfellow et al., 2014) to encourage ambiguity of the outputs via reinforcement learning.We, on the other hand, propose a simple yet effective way to tackle this problem: encouraging ambiguity by incorporating context words related to each sense.
Inspired by humor theories (Lippman and Dunn, 2000), we hypothesize that it is the contextual connections rather than the pun word itself that are crucial for the success of pun generation.For instance, in Figure 1 we observe that context related to both senses (e.g., ungrammatical and jury) appear in a punning sentence.Such observation is important as the error analysis of the SOTA model (Luo et al., 2019) shows that 46% of the outputs fail to be puns due to single word sense, and another 27% fail due to being too general, both of which can be resolved by introducing more context.
Specifically, given the two sense definitions of a target pun word, we first use a reverse dictionary to generate related words that are monosemous for both senses.This first step helps us circumvent the obstacle of processing pun words with the same written form.We then propose to use context words to link the contrasting senses and make our target pun word reasonable when interpreted in both definitions.We explore three dif-Figure 2: Overview of the approach.We also give an example for pun word 'sentence' for each step.
ferent settings: extractive-based, similarity-based, and generative-based.Finally, we finetune the T5 model (Raffel et al., 2020) on general nonhumorous texts to generate coherent sentences given the pun word and contexts words as input.
Our experimental results show that retrieve-andextract context words outperforms the giant fewshot GPT3 model in terms of generating funny pun sentences, although the latter has shown to be much more powerful in many sophisticated tasks (Brown et al., 2020).Our simple pipeline remarkably outperforms all of the more heavy-weighted approaches.Our code and data is available at https://github.com/PlusLabNLP/AmbiPun.

Methodology
Overview and Motivation Our input is the target pun word (p) and its two sense definitions (S 1 , S 2 ), and the output is a list of humorous punning sentences.We implement the ambiguity principle proposed in (Kao et al., 2016): a pun sentence should contain one or more context words corresponding to each of the two senses. 2The overview of our approach is visualized in Figure 2.
Given S 1 and S 2 , we first use a reverse dictionary to generate a list of words that semantically match the query descriptions.We call them related words (Section 2.1).However, those related words are synonyms of the pun word and are rarely observed as it is in humorous sentences.For example, for the sentence: "The Judge has got a stutter.Looks like I am not getting a sentence.",The word representing the first sense (i.e. a final judgment of guilty in a criminal case) is represented by Judge, which could not be generated using the sense definition.
Considering such nuances, in Section 2.2 we propose three different methods to obtain the context words.They are extractive, similarity, and generative based.Finally, in Section 2.3 and 2.4, we introduce a keyword-to-text generator to generate candidate sentences , and a humor classifier to rule out some of the non-pun sentences.Final sentences are then randomly sampled for evaluation.All our training data is general, non-humorous corpus except for the humor classifier.

Generating Related Words
We aim at differentiating the two senses of a polysemy by taking the related words, so that each sense will be represented by a set of monosemous words.To this end, we leverage the Reverse Dictionary (Qi et al., 2020;Zhang et al., 2020) which takes as input a description and generates multiple related words whose semantic meaning match the query description.For each sense definition, we generate five words.

Generating Context Words
For context words, we compare three different approaches.As an example, we compare the output of context words for the pun word 'sentence' in Table 5 in the appendix.Refinement of the context words is mentioned in section A.2 in the appendix.Method 1: Extractive (TF-IDF) For each related word, we retrieve sentences from the One Billion Word dataset containing that word and then extract keywords using RAKE (Rose et al., 2010) from the retrieved sentences.Based on this TF-IDF value, we choose the top 10 context words that are mostly likely to be used along with the related words and therefore the pun word.Method 2: Similarity (Word2Vec) Inspired by the idea that "a word is characterized by the company it keeps", we propose to get context words from word2vec.(Ghazvininejad et al., 2016) have also shown that the training corpus for word2vec plays a crucial role on the quality of generated context words.Hence, we train on Wikipedia which has a comprehensive coverage of diverse topics.Method 3: Generative (Few-shot GPT3) For the generative version, we provide the powerful language model GPT3 (Brown et al., 2020) with two examples and generate context words.Details about the prompt can be found in section E of the appendix.

Candidate Sentence generation
After receiving context words for each sense, we generate humorous puns.We finetune a keywordto-sentence model using T5 (Raffel et al., 2020), as it is capable of handling text-to-text tasks.The prompt provides the pun word, and two context words from each of the two senses.For example for the word 'sentence', a possible prompt can be generate sentence: sentence, judge, trial, noun, comma.Moreover, we also investigate whether the position of the pun word will affect the quality of generated sentences.We insert the pun word in the first, third, and fifth place of the prompt.We discuss our findings in Section 5.

Humor Classification
Finally, we introduce a classification model to assist us in selecting (i.e., ranking) punning sentences.Since we do not have sizable training data for puns, we propose to train our classification model on humorous dataset in a distantly supervised fashion.Specifically, we train BERT-large (Devlin et al., 2018) on the ColBERT dataset (Annamoradnejad and Zoghi, 2020) that contains 200,000 jokes and non-jokes used for humor detection.We use the probability produced by the classification model to rank our candidate sentences.
Our error analysis in section Appendix.B shows that our classifier can successfully rule out the bad generations, i.e., non-puns, as puns are humorous by nature.However, the model is not great at choosing the best samples.Therefore, we use this classifier only to remove the bottom third candidates.We leave this for open future work to accurately pick out high-quality punning sentences instead of funny sentences.

Datasets
Training: For the context word generation step, we use the One Billion word dataset (Chelba et al., 2013) to retrieve sentences for a given word and calculate TF-IDF values.This dataset contains roughly 0.8B words and is obtained from WMT 2011 News crawl data.For the humor classifier and candidate generation module, we use ColBERT dataset (Annamoradnejad and Zoghi, 2020).It contains 100k jokes and 100k non-jokes.Evaluation dataset: Following other pun generation works, we use the SemEval 2017 Task 7 (Miller et al., 2017)  total of 895 unique pun words.Each sentence has the target pun word, location of the pun word and the WordNet sense keys of the two senses.

Baselines
Neural Pun Yu et al. (2018) propose the first neural approach to homographic puns based on a constrained beam search algorithm to jointly decode the two distinct senses of the same word.
Pun-GAN The SOTA introduced by Luo et al.
(2019) that adopts the Generative Adversarial Net to generate homographic puns.Few-shot GPT3 We generate puns with a few examples feeding into GPT3 davinci-instruct-beta, the most capable model in the GPT3 family for creative generation.We provide the target pun word and its two senses in our prompt, along with the instruction to generate puns.Ablations of our own models We also compare three methods proposed by us to obtain the context words (described in Section 2.2).We call them Ext AMBIPUN, Sim AMBIPUN, and Gen AMBIPUN.Table 3: Generated sentences for the word 'Irrational' and 'Drive'and their sense definitions.We underline the context words that are related to each sense.All the generations are evaluated by external annotators, not the authors.

Evaluation
Automatic Evaluation We follow Luo et al. (2019); Yu et al. (2018) to calculate distinct unigram and bigrams as the diversity (Li et al., 2016) in terms of sentence level and corpus level.We also report the the average sentence length produced.
Human Evaluation We randomly select 100 sentences and collected our human ratings on Amazon Mechanical Turk (AMT).For each sentence, three workers are explicitly given the target pun word and its two sense definitions provided by the Sem-Eval 2017 Task 7. We first ask them to judge if a given sentence is a successful pun sentence.Then, they are asked to rate the funniness and coherence on a scale from 1 (not at all) to 5 (extremely).Besides detailed instructions and explanation for each criteria, we also adopt attention questions and qualification types to rule out irresponsible workers.We conduct paired t-test for significance testing.The difference between our best performing model and the best baseline is significant.

Pun Generation Results
Automatic Evaluation Results of the automatic evaluation can be seen in Table 1.The average sentence length of our model is closest to human and gets the highest sentence-level diversity.Although our baseline Pun-GAN and Few-shot GPT3 have higher distinct unigram ratios at the corpus level, that is because all baseline models generate one sentence per pun word, while AMBIPUN generates tens of sentences per pun word, which inevitably sacrifices topical diversity.Nevertheless, AMBIPUN achieves the highest corpus-level bi-gram diversity.
Human Evaluation Results from the automatic evaluation can be seen in Table 2.We evaluate the success rate, funniness, and coherence of the generated outputs.The superiority of our models are obvious.For significance testing, we conducted paired t-test, where our systems outperformed the best baseline in terms of success rate and funniness (p-value < 0.05).On the other hand, GPT3 could generate even more coherently than humans.
Analysis between extractive and generative method.Interestingly, the extractive method has higher success rates (p-value < 0.05) and is funnier (p-value < 0.07) than the generative method, indicating that extracting salient words from human written sentences could introduce more surprising and uncommon words than language models.We posit that those atypical words refresh people's eyes and thus boost the pun success rate as well as the funniness score.On the other hand, we also tried to equip GPT3 with greater creatively by top-k sampling with a large temperature T .However, larger T s also result in arbitrary responses that humans may find unreadable.We hope our discovery could draw the community's attention to those traditional techniques for creative generation.

Case Study
To better understand the advantages of our method, we conduct a case study for the pun word "Irrational" and "Drive" in

The Position of Pun Words
As is mentioned in Section 2.3, we play with the position of the pun word in the prompt given to the candidate sentence generation model.We try three variants by putting the target pun word at the start, in the middle, and at the end.For each variant, we ask Mechanical Turkers to judge if the given sentences are puns.Again, each sentence is rated by three Turkers and we take the majority answer if the workers disagree.Results from this analysis can be seen in Table 4.We observe that people find a sentence more likely to be a pun when the target word appears at the end.
To verify such hypothesis, we also calculate the position of the pun words of 1,163 human written pun sentences from SemEval 2017 Task 7 and report the distribution in Figure 3 in the Appendix.The histogram corroborates with the human annotated samples in that both suggest that keeping the pun word at the end of the sentence generates funnier puns.Theory of humor which says that the "joke" in a funny sentences some towards the end of the sentence (Shahaf et al., 2015) validates our analysis.
6 Related Works 6.1 Creative Text Generation Pun generation.Many of the previous works on pun generation have focused on phonological or syntactic pattern rather than semantic pattern (Miller and Gurevych, 2015;Hong and Ong, 2009;Petrović and Matthews, 2013;Valitutti et al., 2013) thus lacking creativity and flexibility.2019) propose complex neural model architecture such as constrained language model and GAN, and do not put emphasis on the linguistic structure of puns.We identify their absence of both the senses as a shortcoming and build our approach from there.
Humor generation.Humor generation still remains an unsolved problem, and is usually studied in a specific setting.Petrović and Matthews (2013) generates joke of the type 'I like my X like I like my Y, Z'.Garimella et al. (2020) develops a model to fill blanks in madlibs format and Yang et al. (2020) edit headlines to make them funny.More research is required to generate humorous sentences that are not constrained by their semantic structure.
Figurative language generation.In addition to pun, there are many attempts to generate figurative language such as metaphor, simile (Chakrabarty et al., 2020b), sarcasm, etc. Yu andWan (2019)

Pun detection
SemEval 2017 Task 7 (Miller et al., 2017) introduced the challenge of pun detection, location detection and sense interpretation for homographic and homophonic puns.Diao et al. (2019) make use of Gated Attention network to detection homophonic puns.Zou and Lu (2019) introduces a tagging schemes which lets them detect puns as well as their location.They apply this approach to both homophonic and homographic puns.

Conclusion
We propose a novel approach towards homographic puns generation.Unlike previous works that are mathematically heavy, our approach is backed by the humor theory that ambiguity is achieved by the context.Automatic and human evaluations show that our model AMBIPUN outperforms the current state-of-the-art model by a large margin.
He et al. (2019) make use of local-global surprisal principle to generate homophonic puns and Yu et al. (2020) uses constrained lexical rewriting for the same task.Hashimoto et al. (2018) use a retrieve and edit approach to generate homographic puns and Yu et al. (2018); Luo et al. ( Figurative language generation.In addition to pun, there are many attempts to generate figurative language such as metaphor, simile(Chakrabarty  et al., 2020b), sarcasm, etc. Yu andWan (2019) use metaphorically used verbs to generate metaphors in an unsupervised fashion, while Chakrabarty et al. (2021); Stowe et al. (2021) generate metaphors using symbolism and discriminative and conceptual mapping.Mishra et al. (2019) propose a modular architecture for unsupervised sarcasm generation, and Chakrabarty et al. (2020a) use commonsense knowledge for the same task.Tian et al. (2021) on the other hand are the first leverage semantic structure and commonsense and counterfactual knowledge to generate hyperboles.

Table 1 :
for evaluation.The dataset contains 1,163 human written punning jokes with a Results of automatic evaluation on average sequence length, sentence-level and corpus-level diversity.Boldface denotes the best performance and underline denotes the second best performance among systems.

Table 2 :
Human evaluation results on all the pun generation systems.We show the success rates, and average scores of funniness and coherence.Overall, Ext AM- BIPUN performs the best.The superiority of our model in terms of success rate and funniness is statistically significant over the best baseline and is marked by *.

Table 3 .
For both target

Table 4 :
The pun success rate sentences based on their position annotated by human.