A Unified Framework for Pun Generation with Humor Principles

We propose a unified framework to generate both homophonic and homographic puns to resolve the split-up in existing works. Specifically, we incorporate three linguistic attributes of puns to the language models: ambiguity, distinctiveness, and surprise. Our framework consists of three parts: 1) a context words/phrases selector to promote the aforementioned attributes, 2) a generation model trained on non-pun sentences to incorporate the context words/phrases into the generation output, and 3) a label predictor that learns the structure of puns which is used to steer the generation model at inference time. Evaluation results on both pun types demonstrate the efficacy of our model over strong baselines.


Introduction
Recently, computational humor theories investigating why puns are funny have shown high correlations with human judgments.Kao et al. (2016) use a probabilistic model to decompose puns into two dimensions: ambiguity of meaning and distinctiveness of viewpoints, and show that these two aspects combined have the strongest alignment with human judgments (p<5‰).He et al. (2019) show that ambiguity/distincitveness alone cannot capture the whole picture, and develop an additional metric to measure how much surprise is aroused when the pun word and alternative word are flipped.For example in Figure 1, the pun word is soled and the alternative word is sold.Seeing soled in the phrase 'were soled at the store at half price' instead of sold arouses surprise in the local context but makes sense in the global context.
Despite the success in identifying important linguistic traits of successful pun sentences, how to incorporate these aforementioned theories into the  pair (e.g.'soled-sold') is the input.After retrieving a suitable context word and a phrase, we use a pun label predictor to steer the base GPT-2 model to produce puns.Labels D1/D2/A mean the next word should be distinct to (supporting) the pun word, distinct to (supporting) the alternative word, or ambiguous.A '-' mark means the label predictor is less confident and thus we do not intervene the generation process.
pun generation process is still an open problem.
Although He et al. (2019) propose a retrieve-andedit approach to incorporate surprise, their error analysis shows that the proposed retrieval methods are often unsuccessful.Moreover, existing works on pun generation are split up in terms of generating homographic puns, wherein the same written word has two or more meanings (Mittal et al., 2022;Yu et al., 2020Yu et al., , 2018)), and homophonic puns, where two words that sound similar have different meanings (Luo et al., 2019;He et al., 2019;Hashimoto et al., 2018).There lacks a unified generation framework for both types of puns.
In this work, we incorporate all three principles: ambiguity, distinctiveness, and surprise into pun generation, and bridge the gap between the two pun types.We hypothesize that there is a learnable structure for puns regardless of the pun type, and propose a unified framework by converting homographic puns to homophonic ones.Specifically, we carefully extract from a non-pun corpus 1) a context word that supports the meaning of the pun word, and 2) a phrase that is both characteristic to the alternative word and compatible with the pun word.Next, we train a discriminator on existing homophonic pun data to learn the structure of a pun -the type of each word in the sentence, which could be one of 'A' -ambiguous, 'D1' -distinct to the pun word, or 'D2' -distinct to the alternative word.One challenge, however, is that there are no ground truth labels.To this end, we collect a small amount of human annotations and boost from weak, unsupervised models to stronger, supervised models.At inference time, a label predictor is used to guide a base GPT-2 model to generate puns.At each generation step, we re-score the tokens generated by the base language model according to the predicted type, except for the case when the label predictor's confidence is under a set threshold.Our model outperforms existing baselines for both pun types.

Related Works
Linguistic traits of puns.Kao et al. (2016) decompose puns into two dimensions -ambiguity of meaning and distinctiveness of viewpoints, and show that that ambiguity is useful in distinguishing non-puns from puns, while distinctiveness is useful when spotting good and funny puns from bad or boring non-puns.To the best of our knowledge, we are the first to formally incorporate the famous ambiguity-distinctiveness principle to guide pun generation.In addition, He et al. (2019) propose the local-global surprisal principle to measure the humorous effect aroused when a word appears unexpectedly in the local context but makes sense given the global context, based on which we improve the way surprise is introduced in generation.
Pun generation.Existing works on pun generation often rely on naive intuitions of semantic ambivalence.For example, Yu et al. (2018) and Luo et al. (2019) promote the ambivalence of the pun word via a constrained language model and reinforcement learning; others find related words to support semantic ambiguity (Yu et al., 2020;Mittal et al., 2022).However, these systems lack serious theoretical backbones and therefore none could evaluate their generated results with regards to the proposed intuitions.What's more, the nature of 'ambivalence' alone leads to generic/boring word choice and short outputs.By incorporating distinctiveness and surprise, we ensure that the generated puns are informative and interesting.
One reason that previous works leverage those simple intuitions to generate puns (He et al., 2019;Yu and Wan, 2019;Yu et al., 2020) is that the small corpora size (Miller et al., 2017;Sun et al., 2022a) makes it impractical to train generation models end-to-end using human written puns.We hence propose to learn the structure of puns instead of the actual texts, which requires far less data to train on.Finally, all previous works (except a concurrent one (Sun et al., 2022b)) can only generate either homographic puns or homophonic puns.Leveraging the shared structure of puns regardless of the pun type, our model can generate both pun types.

Methodology
The input to our system is a pun word-alternative word pair (pw-aw, e.g., soled-sold), and the target output is a high-quality pun sentence that contains pw, e.g., 'The leather boots he was wearing were heavily abraded, and were soled at the store at half price.'In this section, we first describe the three components to generate homophonic puns as shown in Figure 1: a context word and phrase selector, a label predictor and the procedure of curating training signals, and the generation module in Section 3.1 to 3.3.Then, we migrate the whole system to homographic puns in Section 3.4.

Obtaining Context Words and Phrases
We retrieve and select two things: a context word that supports the meaning of the pun word, and a phrase that is both characteristic to the alternative word and compatible with the pun word.
Inspired by He et al. (2019), given a punalternative word pair (pw − aw), we look for an ideal phrase that contains aw and replace it with pw to arouse surprise.To this end, we first extract multiple (N 1 =20) phrases that contain aw from a large non-pun corpus consisting of 20,000 sentences from Wikipedia and Gutenberg BookCorpus (Lebert, 2009), and rank the phrases by how well they exhibit the semantics of the pun pair.Specifically, we first replace aw with a '<mask>' token, and run RoBERTa-Large (Liu et al., 2019) to obtain the probability of aw in the masked position.We remove the less probable half, filtering out those that are less characteristic of aw, as shown in Table 1.Next, we conduct a similar mask infilling procedure for pw, and select the middle-ranking phrase to avoid it being either too general (e.g., 'a new taxi was created') or too incompatible (e.g., 'an export taxi on agricultural products').These two rankings ensure the final selected phrase arouses surprisal when people see pw instead of aw, but also still find it reasonable.
For obtaining the context words, our idea is similar to that proposed by (Mittal et al., 2022).We retrieve sentences from the same non-pun corpus  containing the target word (pw), and then extract keywords from the sentences using RAKE (Rose et al., 2010).Based on the TF-IDF values of those keywords, we take the top N 2 (N 2 =20) words that uniquely co-occur with the target word, and then randomly sample one to encourage creativity.

Label Predictor
Our label predictor aims at learning and predicting the word-level structures of 1,500 human-written pun sentences collected by Miller et al. (2017).Each word in a pun sentence falls into one of three types: 'A' for ambiguous, 'D1' for distinct to the pun word, 'D2' for distinct to the alternative word.We finetune a BERT (Devlin et al., 2018) sequence classification model to predict the next token type.In this section, we will first talk about a data-efficient label collection procedure that boosts from a weak, unsupervised method to a stronger, weakly supervised method .
Ground Truth Label Curation Before we could train the model, how can we automatically categorize each word in a pun sentence?We start with an unsupervised approach: word semantic similarity.Specifically, we compute the cosine similarity between the glove embeddings of pw and aw and each word in the sentence (tw), and label the word as D1/D2 if the difference is larger than a threshold T (i.e., |cos(tw, pw) − cos(tw, aw)| > T ).Otherwise, we label the word as A. We also compute their correlation with human judgements.In total, we collected human annotations on 500 data points.Since the label predictor should predict the type/category of the next word without knowing the future, we mimic this setting and show human annotators 1) an incomplete pun sentence, i.e. containing the part of the sentence before the current word tw being evaluated, 2) tw, and 3) pw and aw, and ask them to decide whether tw is distinct to pw, distinct to aw, or ambiguous.With grid search, we find out that with optimal T set to 0.15, the aforementioned purely unsupervised method gets 72.9% labeling accuracy. 2o further improve the reliability of the 'ground truth' labels, we finetune a BERT-base model as a sequence classifier to classify each word into the three categories.The intuition is to provide this BERT classifier with less noisy data so that it can learn the task better than the unsupervised approach.The training data of this BERT classifier includes 8,000 automatic labels obtained using the word unsupervised method.A word is considered distinct (D1/D2) if the difference is > 1.5T, and ambiguous if the difference is < T. In order to compose a dataset with cleaner labels, we simply disregard those training samples whose semantic difference is [T, 1.5T].We include the incomplete pun sentence, the current word, and the pun pairs as input.Using this, we are able to improve the 'gold' label accuracy to 84.6% on a human annotated held-out dataset.We call this model Bert c .
Training The label predictor used in our framework predicts the type of the next word that is yet to be generated.We call this model Bert n . 3The training data of Bert n comes from two parts: an additional 430 human labeled data that the unsupervised method and the Bert c disagree on, and 8,000 automatic labels where both models agree.A breakdown of its performance by category is reported in Table 2.In addition, we could further improve the predictor's F1 score by considering its confidence.We gain an average of 14.9% increase by discarding only 9.8% cases that the label predictor is less confident on.We argue that there can be multiple choices for the next word, and hence the best performance of this task is bounded.Considering that the task of predicting the type of the next word is much harder because there can be multiple choices as next words, we take into account the confidence level when we use it in the next step.

Generation Module
Data Preparation and Fine-tuning We finetune the GPT-2 model on a combination of Gutenberg BookCorpus and jokes (Annamoradnejad and Zoghi, 2020) to learn the task format: given a keyword and a phrase as input, generate a sentence containing them both.For each sentence in the corpus, we use RAKE to extract two salient words.We then include the surrounding context around one of the salient words as the phrase.We make sure that positions of the two extracted keywords are far away enough that the phrase does not contain the other extracted word.
Inference At inference time, we feed the phrase and context word obtained in Section 3.1 as and steer the finetuned language model using the label predictor.At each step, we get the predicted type of next token, L, with the corresponding confidence c.If c is larger than a threshold T , we score and rerank the candidate next words by the corresponding label, which can be found in Algorithm 1. Otherwise, if our label predictor is less confident, we do not intervene in the language model's generation.We also enforce the appearance of the complete phrase during decoding when the first two words in the phrase have been generated.

Migrate to Homographic Puns
We convert the task of generating homographic puns to homophonic puns by leveraging a wordsense disambiguation (WSD) model (Bevilacqua and Navigli, 2020).For example, if the target pun word is "sentence" and the two sense definitions are "a set of words..." and "the punishment...", we run the WSD model to identify which extracted phrases exhibit the second sense.Next, we obtain two new words using a reverse dictionary (Qi et al., 2020): 'clause' for the first sense and 'conviction' for the second.Then the task can be viewed as that for homophonic puns, where the substitute pw is 'clause' and the substitute aw is 'conviction'.The rest of the generation process is the same as in Section 3.2 and 3.3.

Compared Models
We compare with the best two existing models for each pun type.Homophonic: SurGen (He et al., 2019), a retreive-and-edit model using the local-global surprisal principle; and LCR (Yu et al., 2020), the SOTA model that first finds appropriate lexical constraints and then rewrites the sentence.Homographic: Pun-GAN (Luo et al., 2019), a model that adopts GANs to encourage ambiguity; AmbiPun (Mittal et al., 2022), the SOTA model that generates puns by including contexts words from both senses.We also compare the ablations of our own models: the base GPT-2 model where a random word and phrase is given, the base model with the label predictor added or the selected context word and phrase (which we call '+ select') added, and best model that includes both the label predictor and selection.

Automatic Evaluation
For each system, we compute the ambiguity, distinctiveness, and the surprisal ratio (Kao et al., 2016;He et al., 2019), and report the results in Table 3.For both pun types, our model surpasses the best baseline by a large margin in terms of distinctiveness, meaning that our model supports distinct viewpoints in the sentence.Notably, our surprisal ratio surpasses that in human-written puns.Moreover, He et al. (2019) have shown that while higher D and S scores usually indicate higher quality, that is not the case for ambiguity.Intuitively, since many ambiguous sentences are not informative (e.g."I went to the bank"), ambiguity alone is insufficient.Our results correlate with the findings that A is useful in distinguishing non-puns from puns, while D and S are useful when spotting good and funny puns from bad or boring puns.Besides, our statistics show that human tend to context the pun word more when writing homophonic puns: 24% of the words are distinct to the pun word, versus 14% for the alternative word.This partially explains the imbalance between D1 and D2 for human-written puns.Our label predictor also learns such distribution and steers base GPT-2 more towards the pun word.

Human Evaluation
We ask qualified workers to judge if the given sentence is a successful pun, and rate the informativeness (or specificness) and funniness on a scale from 1 to 5. The evaluation details can be found in Appendix B. Results in Table 4 show that our model achieves the highest success rate and is the most informative and funny among all machines.We also observe that the improvements over homographic puns are smaller than that of homophonic puns.Upon error analysis, we find that half of the failure cases of homographic puns are due to inappropriate substitute pun/alternative words.Instead of using the sense keys provided in WordNet (Miller, 1995), if the user can manually provide the sense defini-Pun pair mane-main LCR The mane object of the hair was accomplished.
SurGen A trot later, he was sitting away from the mane dining area.

Ours
In some places, hair also makes up the mane entrance to fashion salons.

Human
Lions don't have to worry about every little detail in life, just the mane thing.
Pun pair sentence =⇒clause-punishment Pun-GAN Due to the sentence it is in the United States.

AmbiPun
The sentence is ungrammatical.The jury didn't hear it.

Ours
The language on a two-page sentence for fraud is full of guilt.

Human
The judge has got a stutter.Looks like I am not getting a sentence.tions to ensure the substitute pun pair is reasonable, such a bottleneck shall be resolved.

Ablation and Case Study
To validate the effectiveness of each proposed module, we report their performance in Table 3 and 4 and a bar chart in Appendix B for more straightforward illustration.Both the label predictor and the word/phrase selection process positively contribute to the outputs, and it works best when combined.
A comparison between our model and the baselines is in Table 5.Although existing approaches also include related words for semantic relevance, they tend to be too vague (e.g.LCR and Pun-GAN) or abrupt (e.g.'a trot later' by SurGen).We also showcase the outputs for two more pun pairs along with the predicted token types in Appendix C. Both results demonstrate that our model is best at generating informative puns with humorous effects.

Conclusion
We propose a novel pun generation approach that incorporates three humor principles.To this end, we learn the sentence structures from humanwritten puns, and convert the task of homographic pun generation to homophonic pun generation.Our model achieves strong performance for both types.

Limitations
We discuss several limitations of this work to inspire future research directions.First, our method rely on a small amount of human written puns as a training corpora, and thus might not work well for low resource languages.Second, as can be seen in Table 2, the overall performance of the label predictor is not perfect.While we argue that the task of predicting the type of the next word is naturally difficult as there can be multiple good candidates, the errors of the label predictor may propagate and lead to unnatural outputs.Third, our system could not generate homographic puns as successfully as homophonic ones.Human evaluation and further error analysis show that the main reason of failure is that the generated substitute pun-alternative word pair is bad.Given a homographic pun word, we are currently retrieving its two sense definitions from WordNet (Miller, 1995) using the sense keys provided in the SemEval 2017 dataset (Miller et al., 2017), where the retrieved sense definitions are sometimes imprecise.Future directions include refining the procedure of finding the substitute pun-alternative word pairs, and curating a more accurate definition dataset for homograpic pun words.
Our proposed method is independent of the specific language model being used.The selection process is purely unsupervised and our label predictor can be theoretically combined with any language model as long as we can obtain the top k tokens it produces.Another future direction includes applying our technique to steer the GPT-3 (Brown et al., 2020) or GPT-J model 4 to generate humorous puns.
were soled at the store at half price.

Figure 1 :
Figure 1: An illustration of our approach.The pun word

Table 1 :
An example of the retrieved phrases which are

Table 2 :
The F1 scores of Bertn and its ablations on human annotated testset.

Table 3 :
Results of automatic evaluation on ambiguity (A), distinctiveness to pun word (D1) and alternative word (D2), and surprisal ratio (S).The * indicates ablations of our method where paired t-test shows that the difference between our best performing model and the best baseline is statistically significant (p<0.05).Boldface denotes the best score and underline denotes the second best.

Table 4 :
Results of human evaluation on pun success rate, informativeness and funniness.The * indicates ablations of our method.Boldface denotes the best score and underline denotes the second best.

Table 5 :
Example outputs of different models.The pun pairs are randomly selected.