Context-Situated Pun Generation

Previous work on pun generation commonly begins with a given pun word (a pair of homophones for heterographic pun generation and a polyseme for homographic pun generation) and seeks to generate an appropriate pun. While this may enable efficient pun generation, we believe that a pun is most entertaining if it fits appropriately within a given context, e.g., a given situation or dialogue. In this work, we propose a new task, context-situated pun generation, where a specific context represented by a set of keywords is provided, and the task is to first identify suitable pun words that are appropriate for the context, then generate puns based on the context keywords and the identified pun words. We collect a new dataset, CUP (Context-sitUated Pun), containing 4.5k tuples of context words and pun pairs. Based on the new data and setup, we propose a pipeline system for context-situated pun generation, including a pun word retrieval module that identifies suitable pun words for a given context, and a pun generation module that generates puns from context keywords and pun words. Human evaluation shows that 69% of our top retrieved pun words can be used to generate context-situated puns, and our generation module yields successful puns 31% of the time given a plausible tuple of context words and pun pair, almost tripling the yield of a state-of-the-art pun generation model. With an end-to-end evaluation, our pipeline system with the top-1 retrieved pun pair for a given context can generate successful puns 40% of the time, better than all other modeling variations but 32% lower than the human success rate. This highlights the difficulty of the task, and encourages more research in this direction.


Introduction
Pun generation is a challenging creative generation task that has attracted some recent attention in the research community (He et al., 2019;Yu et al., Figure 1: Context-situated pun generation aims to find relevant pun words to generate puns within a given context.We propose a unified framework to generate both homographic and heterographic puns; examples shown here are human-written puns from our corpus.2018, 2020;Mittal et al., 2022;Horri, 2011).As one of the most important ways to communicate humor (Abbas and Dhiaa, 2016), puns can help relieve anxiety, avoid painful feelings and facilitate learning (Buxman, 2008).At the same time, spontaneity is the twin concept of creativity (Moreno, 1955), which means the context matters greatly for making an appropriate and funny pun.
Existing work on pun generation mainly focuses on generating puns given a pair of pun-alternative words or senses (we call it a pun pair).Specifically, in heterographic pun generation, systems generate puns using a pair of homophones involving a pun word and an alternative word (He et al., 2019;Yu et al., 2020;Mittal et al., 2022).Alternatively, in homographic pun generation, systems generate puns that must support both given senses of a single polysemous word (Yu et al., 2018;Luo et al., 2019;Tian et al., 2022).Despite the great progress that has been made under such experimental settings, real-world applications for pun generation (e.g., in dialogue systems or creative slogan generation) rarely have these pun pairs provided.Instead, puns need to be generated given a more naturally-occurring conversational or creative context, requiring the identification of a pun pair that is relevant and appropriate for that context.For example, given a conversation turn "How was the magic show?", a context-situated pun response might be, "The magician got so mad he pulled his hare out."Motivated by real-world applications and the theory that the funniness of a pun heavily relies on the context, we formally define and introduce a new setting for pun generation, which we call context-situated pun generation: given a context represented by a set of keywords, the task is to generate puns that fit the given context (Figure 1).Our contributions are as follows: • We introduce a new setting of context situated pun generation.
• To facilitate research in this direction, we collect a large-scale corpus called CUP (Context-sitUated Pun), which contains 4,551 tuples of context keywords and an associated pun pair, each labelled with whether they are compatible for composing a pun.If a tuple is compatible, we additionally collect a humanwritten pun that incorporates both the context keywords and the pun word. 1 • We build a pipeline system with a retrieval module to predict proper pun words given the current context, and a generation module to incorporate both the context keywords and the pun word to generate puns.Our system serves 1 Resources will be available at: https://github.com/amazon-research/context-situated-pun-generation as a strong baseline for context situated pun generation.

Task Formulation
Preliminaries.Ambiguity is the key to pun generation (Ritchie, 2005).First, we define the term pun pair in our work.For heterographic pun generation, there exists a pair of homophones, which we call pun word (p w ) and alternative word (a w ).While only p w appears in the pun, both the meaning of p w and a w are supported in the pun sentence.Therefore, the input of heterographic pun generation can be written as (p w , S pw , a w , S aw ), where S pw and S aw are the senses of the pun word and alternative word, respectively.We refer to these as pun pairs, and use the shorthand (p w , a w ) for simplicity.For homographic pun generation, the pun word is a polyseme that has two meanings; here, we can use the same representation, where p w = a w for homographic puns.
Formulation.Given the unified representation for heterographic and homographic puns, we define the task of context-situated pun generation as: Given a context C, which can be a sentence or a list of keywords, find a pun pair (p w , S pw , a w , S aw ) that is suitable to generate a pun, then generate a pun using the chosen pun pair situated in the given context.In this work, we assume we are given a fixed set of pun pair candidates (P w , A w ) from which (p w , a w ) are retrieved.The unified format between heterographic and heterographic puns makes it possible for us to propose a unified framework for pun generation.

CUP Dataset
Motivation.The largest and most commonlyused dataset in the pun generation community is the SemEval 2017 Task 7 dataset (Miller et al., 2017).2Under our setting of context-situated pun generation, we can utilize keywords from the puns themselves as context.However, the majority of pun pairs only occur once in the the SemEval dataset, while one given context could have been compatible with many other pun pairs.For example, given the context beauty school, class, the original pun in the SemEval dataset uses the homographic pun pair (makeup, makeup) and says: "If you miss a class at beauty school you'll need a makeup session."At the same time, a creative human can use the heterographic pun pair (dyed, die) to instead generate "I inhaled so much ash from the eye shadow palette at the beauty school class -I might have dyed a little inside."Because of the limitation of the SemEval dataset, we need a dataset that has a diverse set of pun pairs combined with given contexts.Furthermore, the dataset should be annotated to indicate whether the context words and pun pair combination is suitable to make context-situated puns.
Data Preparation.We sample puns that contain both sense annotations and pun word annotations from SemEval Task 7. We show two examples of heterographic puns and homographic puns and their annotations from the SemEval dataset in Table 1.From this set, we sample from the 500 most frequent (p w , a w ) pairs and randomly sample 100 Annotation.For our annotation task, we asked annotators to indicate whether they can come up with a pun, using pun pair (p w , a w ), that is situated in a given context C and supports both senses S pw and S aw .If an annotator indicated that they could create such a pun, we then asked the annotator to write down the pun they came up with.Meanwhile, we asked annotators how difficult it is for them to come up with the pun from a scale of 1 to 5, where 1 means very easy and 5 means very hard. 44 To aid in writing puns, we also provided four T5-generated puns as references. 5e deployed our annotation task on Amazon Mechanical Turk using a pool of 250 annotators with whom we have collaborated in the past, and have been previously identified as good annotators.Each HIT contained three (C, p w , a w ) tuples and we paid one US dollar per HIT. 6To ensure dataset quality, we manually checked the annotations and accepted HITs from annotators who tended not to skip all the annotations (i.e., did not mark everything as "cannot come up with a pun").After iterative communication and manual examination, we narrowed down and selected three annotators that we marked as highly creative to work on the annotation.To check inter-annotator agreement, we collected multiple annotations for 150 instances and measured agreement using Fleiss' kappa (Fleiss and Cohen, 1973) (κ = 0.43), suggesting moderate agreement.
Statistics.After annotation, we ended up with 2,753 (C, p w , a w ) tuples that are annotated as compatible and 1,798 as incompatible.For the 2,753 compatible tuples, we additionally collected human-written puns from annotators.The number of puns we collected exceeds the number of annotated puns in SemEval 2017 Task 7 which have annotated pun word and alternative word sense annotations (2,396 puns).The binary compatibility labels and human-written puns comprise our resulting dataset, CUP (Context SitUated Puns).Table 2 shows examples of annotations in CUP.

Context-Situated Pun Generation
We propose a pipeline framework to generate context-situated puns, shown in Figure 2. It consists of: (i) a retrieval-based module that selects a set of relevant pun word pairs, and (ii) a generation module that takes the context words and retrieved pun word pairs as input to generate puns.In this section, we briefly describe each component.
Pun Word Pair Retrieval.We propose a retrieve-and-rank strategy to select k relevant pun word pairs (p w , a w ) from a large, fixed set of pun word pairs (P w , A w ) for a given context C. C should be a list of keywords describing the context.If the context is given as a sentence, we use RAKE (Rose et al., 2010) to automatically extract a list of keywords from the context to construct C. For each context C, we apply a classifier to all available pun word pairs in our data (P w , A w ) and retrieve pairs classified as suitable.Then, we rank the suitable instances according to the model's confidence and take the top k instances as the final retrieved (p w , a w ) pun word pairs.We experiment with both supervised and unsupervised approaches to build the retrieval module in Section 5.1.
Pun Generation.Given pun word pair (p w , a w ), pun word senses S pw and S aw , and context C, the pun generation module generates puns that relate to C, incorporate pun word p w , and embody the meanings S pw and S aw of the pun word pair.Since there are limited pun datasets available for model training, we adopt a two-stage strategy that involves pretraining a T5-base (Raffel et al., 2020) model on non-pun text to learn to incorporate words and their senses in generations, then finetuning the model on pun data to learn the structure of puns.We describe our pun generation models in Section 5.2.

Experiments
We design our experiments to answer the following three research questions: Q1.What is the performance of the pun word pair retrieval module?(Section 5.1) What is the performance of the pun generation module?Is the pretraining stage necessary?(Section 5.2) Q3. How well does the pipeline system perform in an end-to-end evaluation?Is the context-situated pun generation task plausible for humans?(Section 5.3)

Pun Word Pair Retrieval
In this task, for a given context C of keywords, the goal is to select k relevant pun word pairs (p w , a w ) from a large, fixed set of pun word pairs (P w , A w ).
Approaches.We experiment with two approaches to building pun word pair retrieval systems, including supervised neural modeling and unsupervised embedding-based approaches.
Neural.We finetune BERT-base (Devlin et al., 2019), RoBERTa-base (Liu et al., 2019) and DeBERTa-base (He et al., 2021) models on the CUP dataset for pun word pair classification.The input is formatted as sentence matching, where, given the context C as sentence 1 and the pun word pair as sentence 2, the output label indicates if the two sentences are compatible.Additionally, we experiment with finetuning natural language inference (NLI) models, RoBERTa-large-NLI (Liu et al., 2019) and BART-large-MNLI (Lewis et al., 2020).
We use the context words as premise and the pun word pair as hypothesis, with entailment and contradiction labels as outputs.For each context, we retrieve all pun pairs classified as suitable by the model, then rank the instances according to the model's confidence (i.e., output from the last layer after softmax) to retrieve the top-k pun pairs.
Unsupervised.The key idea behind the compatibility classification is to find pun word pairs that are semantically close to the context.Therefore, a natural question to ask is, "Can an unsupervised method that measures semantic similarity can perform as well as the neural method?"Here, we use Euclidean distance between Glove embeddings (Pennington et al., 2014) of pun and context words to measure the semantic similarity.Formally, for a context C consisting of a list of context words c 1 , c 2 , ..., c n , we calculate the average Euclidean distance between the Glove representation of p w , a w and the embedding of each of context word c i : Then we rank all 500 possible (p w , a w ) candidates using the distance score above, retrieving the k pairs with the smallest distance as the top-k retrieved pun word pairs.
Experiment Setup.We split CUP into 70% training, 10% validation and 20% test data.Table 3 shows the distribution of pun word compatibility labels in our data splits.For each context word, we use our models to retrieve pun word pairs from 500 candidate pairs for making context-situated puns.Table 3: CUP data splits for the pun word pair retrieval task.We show the distribution of (C, p w , a w ) tuples labeled as suitable or unsuitable in each split.
Evaluation Metrics.For neural models, we first benchmark the accuracy, precision, recall, and F1 of the model's predictions for the pun word pair classification task on the CUP dataset.Additionally, for both approaches, we use the True Positive rate (TP@N) to evaluate the performance of our pun word retrieval module.It measures the percentage of top-k retrieved pun word pairs that can be used to generate puns for a given context.The higher the TP@N is, the stronger the retrieval module is in terms of retrieving appropriate pun word pairs.
Results.We show results of our supervised pun word classifiers in Table 4.Our results show that the task of classifying whether a context word is compatible with a pun word pair is challenging for current pretrained LMs, with a best F1 of 64.72 from RoBERTa-large-NLI.Table 5 shows the TP@N evaluation of pun word pairs retrieved by our best neural model, finetuned RoBERTa-large-NLI, and our unsupervised method.In general, the supervised neural model outperforms the unsupervised method.TP@1 shows that 69% of pun word pairs retrieved by the neural model are compatible with their given context, showcasing the effectiveness of our retrieval module.We provide additional qualitative analysis in Appendix C, Table 9.

Pun Generation
Given pun word pair (p w , a w ), pun word senses S pw and S aw , and context C, the task is to generate a pun that relates to C, incorporates pun word p w , and utilizes both pun word senses S pw and S aw .
Approach.For the novel task of context-situated pun generation, we establish a baseline model that uses a combination of pretraining on non-pun text and finetuning on pun text to generate both homographic and heterographic puns.Our unified framework for homographic and heterographic pun generation is also new to the community.We evaluate the following model variants: AmbiPun (Mittal et al., 2022) (He et al., 2021) 63.930.62 63.840.58 64.140.70 65.730.54 62.551.49 62.481.47 62.701.59 64.911.28 roberta-large-nli (Liu et al., 2019) 67.250.69 67.130.70 67.450.65 68.960.71 64.720.42 64.960.63 64.600.30 67.600.73 bart-large-nli (Lewis et al., 2020) 67.330.74 67.280.82 67.540.52 69.031.05 63.810.39 63.830.53 63.870.20 66.310.79Table 4: Pun word classification performance of neural models on CUP, showing that our task is challenging for pretrained LMs.We report models' performance across three random seeds with standard deviation as subscripts.
Finetuned T5 (T5 FT ).We finetune T5-base (Raffel et al., 2020) on the SemEval 2017 Task 7 dataset (Miller et al., 2017), in which puns are annotated with pun word pairs p w and a w along with their sense information S pw and S aw .We construct C using the RAKE (Rose et al., 2010) keyword extraction algorithm on the pun text, and further verify them against human-annotated keywords from an augmentation of the SemEval dataset we designed to enable keyword-conditioned pun generation (Sun et al., 2022).During finetuning, we use the input prompt: "generate a pun that situates in {C}, using the word {p w }, {p w } means {S pw } and {a w } means {S aw }".The goal of finetuning is to teach the model to incorporate both word senses in the final generated puns.
Finetuned T5 with pretraining (T5 PT+FT ).Here, we investigate whether the model can learn to incorporate words and their senses into the generated sentences by pretraining on non-pun text.To this end, we automatically construct a pretrain-ing dataset from BookCorpus (Zhu et al., 2015).
For each word w ∈ {p w , a w } in a given pun word pair,8 we mine 200 sentences that contain w from BookCorpus. 9 We extract keywords from a given BookCorpus sentence containing w using RAKE to construct context C. We retain noun and verb keywords, as they are more likely to have significant impact at the sentence level (Kim and Thompson, 2000;Cutler and Foss, 1977), and exclude pun word w from the keyword list.Using these automatically-constructed samples, we finetune T5 (Raffel et al., 2020) to generate sentences situated in C that incorporate w, using the input prompt: "generate a sentence that situates in {C}, using the word {w}, {w} means {S w } and {w} means {S w }", the output of which is the retrieved sentence from BookCorpus that uses C and w.
Experiment Setup.We finetune our T5 models on 1,382 training samples from SemEval Task 7 that contain both pun word and sense annotations.
For testing, we randomly sample 200 (C, p w , a w ) tuples from CUP that annotators marked as compatible.We use each model to generate puns for this set and compare their performance.10 Evaluation Metrics.We report pun word incorporation rate as the automatic evaluation metric to measure the model's ability to incorporate pun words in the final generation.We also conduct human evaluation on Amazon Mechanical Turk to judge whether the generated puns are successful.Table 6: Pun generation results using automatic (pun word incorporation) and human (success rate) evaluation.We compare our finetuned T5 models to a stateof-the-art baseline, AmbiPun (Mittal et al., 2022).PT stands for Pre-Training and FT stands for Fine-Tuning.
Results.Pun generation results are shown in Table 6.We find that: (1) adding the pretraining stage helps our model better incorporate pun words, and (2) our generation module can generate successful puns at almost triple the rate of the current state-ofthe-art framework AmbiPun (examples in Table 7).We hypothesize that this is because AmbiPun is a completely unsupervised approach in which the pun generator is not finetuned on any pun data, and because our models additionally benefit from rich word sense information in the input.

End-to-end Evaluation
Finally, we evaluate how well our pipeline retrieves relevant pun word pairs and generates novel puns given a context of keywords in an end-to-end fashion, and compare our pipeline's performance to human-standard annotations from CUP.
Experiment Setup.We randomly choose 60 context words to conduct the end-to-end evaluation.
For each context, we use both unsupervised and neural pun word retrieval modules from Section 5.1 to retrieve the top-1 predicted pun word pair, then use each of the pun generation modules from Section 5.2 to generate puns using the retrieved pun word pair.We also compare with human performance.For each context, we find the humanwritten pun in CUP that annotators indicated was least difficult to write, randomly sampling one pun in case of ties.We use annotation difficulty as a proxy for ranking human context-situated puns, assuming more natural puns are easier to write.
Evaluation Metrics.We measure the incorporation rate of context words C and pun words p w as automatic evaluation metrics.In addition, similar to standalone pun generation evaluation, we conduct human evaluation to judge whether the generated puns are successful.
Results.We report results of combinations of our retrieval and generation modules in Table 8.We show that: (i) our pretraining step is helpful in terms of both improving the keyword incorporation rate and pun success rate of the generation module, despite using retrieved pun words as input.(ii) Our pipeline system performs the best among all model variations, yielding a success rate of pun generation of 40%.This success rate improves over the best reported in Section 5.2 (31%), showcasing the benefit of using our neural pun word retrieval module over randomly sampling pun word pairs for a given context.However, (iii) the best model performance is still about 32% lower than the human success rate, indicating that humans can complete the context-situated pun generation task plausibly and much more successfully, indicating large room for improvement.

Related Work
Our work proposes an approach for conditional generation of humor using a retrieve-and-generate framework.More specifically, our work enables a constrained type of pun generation.We briefly summarize existing work in these directions.
Humor generation.With the recent advent of diverse datasets (Hasan et al., 2019;Mittal et al., 2021;Yang et al., 2021), it has become easier to detect and generate humor.While large pretrained models have been fairly successful at humor detection, humor generation still remains an unsolved problem, and is usually studied in specific settings.Petrović and Matthews (2013)   Table 8: End-to-end evaluation of our system against AmbiPun and human baselines.
tic parses and then generate paraphrases that keep the semantic meaning while conforming to the retrieved syntactic parses.
Pun generation.Previous work on pun generation has focused on heterographic pun generation or homographic pun generation (Miller and Gurevych, 2015;Hong and Ong, 2009;Petrović and Matthews, 2013;Valitutti et al., 2013).At the same time, all of them require an input of pun words and assume pun words are given.Heterographic pun generation requires a pair of homophones, and homographic pun generation requires a polyseme, i.e., a pun word that has more than one meaning.For heterographic pun generation, the two meanings come from the pair of homophones, while for homographic pun generation, the two meanings come from the polyseme itself.Motivated by this, we propose a unified framework that can generate both heterographic puns and homographic puns adaptively.

Conclusion and Future Work
We propose a new setting for pun generation: context-situated pun generation.As a pioneering work, to facilitate future research in this direction, we first collect a large-scale corpus, CUP, which contains 4,551 annotated context and pun word pairs annotated for compatibility, along with 2,753 human-written puns for the compatible pairs, which is of an even larger size than the current most commonly-used pun dataset, SemEval 2017 Task 7 (Miller et al., 2017).To benchmark the performance of the state-of-the-art NLG techniques on the proposed task, we build a pipeline system composed of a pun pair retrieval module that identifies suitable pun pairs for a given context, and a generation module that generates context-situated puns given the context and compatible pun pairs.Human evaluation shows that the best model achieves 40% success rate in end-to-end evaluation, trailing behind human performance by almost 32%, highlighting the challenge of the task and encouraging more future work in this direction.
Our work introduces the concept of situating in context to pun generation.However, future work can easily extend the idea and framework to other areas of creative generation, such as metaphor generation, lyric generation, and others.Another promising future direction is to integrate the generated puns into the original conversational or situational context to improve the interestingness and engagingness of the downstream applications.We hope our work can inspire more innovations on context-situated creative generation.
Figure 2: Our framework contains two components: (i) a retrieval component (top) that identifies relevant pun words for a given context, and (ii) a generation component (bottom) that takes the context and retrieved pun words and generates context-situated puns.

Table 1 :
Two examples each of heterographic puns and homographic puns in the SemEval 2017 Task 7 dataset.We construct context C by extracting keywords from the pun and excluding the pun word p w .Word sense information S pw and S aw are retrieved from WordNet from SemEval annotated senses.

Table 2 :
Example annotations from the CUP dataset.Labels L indicate whether the annotator was able to write a pun given the context and pun pair. 7 7Further experimental details in Appendix A.

Table 5 :
TP@N results for supervised (neural) and unsupervised approaches for pun word retrieval.TP stands for True Positive rates. 11 A scientist who is a liquid chemical expert can't assay the problem.chemicals, problem say Ambi.: What do you call a scientist with a liquid chemicals problem? an assay-ist.She was only a Fruit Vendor's daughter, but she yammered.daughter yam Ambi.: My daughter yammered at the fruit vender... she said i'm not a fruit vender.
He et al. (2019) make use of local-global surprisal principle to generate heterographic puns and Yu et al. (2020) use constrained lexical rewriting for the same task.Hashimoto et al. (2018) use a retrieve and edit approach to generate homographic puns and Yu et al. (2018); Luo et al. (2019) propose complex neural model architectures such as constrained language model and GAN.Mittal et al. (2022) generate homographic puns given a polyseme and try to incorporate the multiple senses of the polyseme.Tian et al. (2022) proposed a unified framework to generate both homographic and homophonic puns.Our setting is different from all previous work, first asking what pun words we should use for generating a pun in a given context.Meanwhile, our work finds the connection between heterographic pun generation and homographic pun generation: both types must utilize the two meanings of a pair of words.