Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End

We consider the end-to-end abstract-to-title generation problem, exploring seven recent transformer based models (including ChatGPT) fine-tuned on more than 30k abstract-title pairs from NLP and machine learning (ML) venues. As an extension, we also consider the harder problem of generating humorous paper titles. For the latter, we compile the first large-scale humor annotated dataset for scientific papers in the NLP/ML domains, comprising 2.6k titles. We evaluate all models using human and automatic metrics. Our human evaluation suggests that our best end-to-end system per-forms similarly to human authors (but arguably slightly worse). Generating funny titles is more difficult, however, and our automatic systems clearly underperform relative to humans and often learn dataset artefacts of humor. Finally, ChatGPT, without any fine-tuning, performs on the level of our best fine-tuned system.


Introduction
Computer-assisted writing is an important and longstanding use case of natural language processing (NLP) and natural language generation (NLG).A popular early scenario involved machine translation, where the success of an MT system was measured in the number of human edits required to transform system output into an adequate translation (Krings, 2001); the HTER metric, which measures this, has remained an important metric until recently (Specia and Farzindar, 2010;Specia et al., 2021).The recent success of large-scale language models (LLMs), such as the GPT generation of NLG models, has made the goal even more realistic and promises full-scale automatic text generation, without any human intervention.
In this work, we concern ourselves with automatic text generation in the scientific domain.Sample scenarios in this general context involve (semi-)automatically generating reviews for scientific papers (Yuan et al., 2022), e.g., as a response to high reviewing load in the face of exploding submission numbers; and generating captions for tables that require reasoning capabilities (Moosavi et al., 2021).Our goal is much more modest: we ask whether language models can generate adequate titles given a human authored abstract as an input; we refer to this task as A2T (abstract-to-title generation).Title generation is important as titles are the first access point to paper; a good title may thus attract more readers and consequently increase a paper's impact, e.g., in terms of citation numbers (Falagas et al., 2013).
We approach the problem as a standard sequenceto-sequence text generation problem, where we fine-tune pre-trained language models on more than 30k abstract-title pairs from the ML and NLP communities.Besides generating titles per-se, we also aim for generating humorous title, an inherently difficult problem due to small sample size -in our corpus, less than 5% of human titles are estimated to be funny.Generating funny titles may be relevant, as a funny title may attract more readers and thus garner (even) more citations (Heard et al., 2022).
Our contributions: • To our knowledge, we provide the first publicly available humor annotated dataset for scientific titles in the NLP and ML domain, encompassing 2,441 humor annotated titles annotated by 2 annotators with decent levels of agreement (kappa ∼0.65).
• We explore 6 recently popular text generation systems on the A2T task, finding one of them to be competitive to human titles, according to automatic and human evaluation involving 15 arXiv:2212.10522v1[cs.CL] 20 Dec 2022 annotators.
• We analyze the problem and find that the A2T task is to some degree ill-posed as a good title may leverage more than the abstract alone (we argue that the problem framing is still a legitimate and efficient approximation).
• For humor generation, we find that our models clearly underperform relative to humans and instead often learn dataset artefacts.
• We finally analyze ChatGPT on a small scale and find that it may be competitive (albeit slightly weaker) to our best fine-tuned model without any task-specific fine-tuning at all.

Related Work
Title generation and evaluation Mishra et al. (2021) perform A2T with pre-trained GPT-2 finetuned on arxiv papers and subsequent (rule-based) modules of title selection and refinement.We compare many more text generation models for the task, use better evaluation (excluding low-quality outdated surface-level metrics such as BLEU; including more comprehensive human evaluation), do not make use of rule-based selection and also consider humor in title generation.Putra and Khodra (2017) classify sentences from paper abstracts into rhetorical categories, retain those relating to methods and results and then generate titles using templates.They further note the relationship between the task of summarization (Nenkova et al., 2011) and A2T, as a title can be seen as a summary of the research paper.We also leverage the relationship to summarization by considering pre-trained models fine-tuned on summarization datasets.In contrast to Putra and Khodra (2017) and Mishra et al. (2021), we only consider end-to-end models that do not involve error-prone pipelines.
Beyond title generation, related fields of text generation for science are related work generation (Li et al., 2022) and, outside of science, headline generation e.g. for news (Tan et al., 2017).Tan et al. (2017) use a coarse-to-fine approach which first identifies important sentences and then converts them into a headline.In this way, the model is not confused by 'too much' irrelevant information.In A2T, the first summarization step is apparently not necessary, as the abstract is already a summary of the scientific paper.Yuan et al. (2022) explore automatically generating reviews for scientific articles.
How titles should be (and are) structured has been researched for a long time, e.g., Hartley (2005); Lewison and Hartley (2005).Hartley (2008) gives a categorization of title types, distinguishing 13 title classes, e.g., those that state results vs. methods.
Humor identification and generation Humor detection is a niche area in NLP but nonetheless with a rich history.For example, Mihalcea and Strapparava (2006) distinguish funny from non-funny sentences (heuristically scraped from the Web) using features and traditional classifiers.Simpson et al. (2019) focus on efficiently annotating humor and inducing classifiers from crowdsourced data.Recently, Peyrard et al. (2021) show that transformers are strong at distinguishing funny from non-funny sentences on minimal pairs of satirical news headlines.In the scientific domain, Heard et al. (2022) annotate a dataset of more than 2k titles from ecology using a fine-grained Likert scale.The majority were labeled as non-funny and annotators exhibited low agreements.
There is considerably less work on humor generation.As one exception, He et al. ( 2019) generate puns by a retrieve-and-edit approach based on word2vec, thus circumventing the problem of little training data for puns.

Data
We use the dataset released by Beese et al. (2022), which contains title-abstract pairs and corresponding meta-information such as the publication year and venue.Beese et al. (2022) extracted the data from two sources: ACL Anthology (from 1984 to 2021) and machine learning conferences (from 1989 to 2021); we refer to the datasets from these two sources as NLP and ML respectively, following the original paper.
Filtering (1) We restrict the data to the main conference papers (e.g., EMNLP, ACL), to ensure data quality.(2) As Figure 1 shows, most abstracts have less than 400 words; by introspection, we find that extremely long abstracts often contain extra sections other than abstracts, due to limitations of Allen AI's Science Parse2 which was used to automatically extract the paper structures by Beese et al. (2022).As a consequence, we limit the data to abstracts of length smaller than 400.(3) In addition, we only leverage the data of papers published after the year 2000 (which form the majority anyway).After filtering, 32,952 abstract-title pairs remain in our dataset.Humor Annotation + Classification To generate humorous titles from the abstracts, in addition to the abstract-title pairs, we need labels regarding the humorousness of the titles.Thus, we train humor classifiers to automatically label the titles as FUNNY, FUNNY medium , and ¬FUNNY (denoted as 2, 1 and 0 respectively; examples see Table 1).Two co-authors participated in the annotation.We made no guidelines and asked annotators to refer to their intuition for a notion of humor.To measure annotation quality, we calculate Cohen's Kappa agreement (Cohen, 1960) between them several times during the whole procedure.
Stage 1: The two annotators initially annotated 1,730 titles, obtaining 0.650 Kappa agreement on the commonly annotated 300 titles, where 1,603 titles were labeled as ¬FUNNY, 106 as FUNNY medium , and 21 as FUNNY.Since funny titles (FUNNY or FUNNY medium ) make up only ∼7.3% of the annotated data, we randomly generate 11 different data splits, where the train set of each split consists of 100 funny or medium funny titles and 200 not funny titles (all randomly drawn), while the remaining 27 funny titles and 27 not funny titles compose the dev set.From those data splits, we train 11 classifiers to construct an ensemble classifier (checkpoints selected based on the macro F1 on each dev set).To evaluate the classifier performance, the two annotators annotated another 315 titles jointly, obtaining 0.639 Kappa agreement.Our best ensemble classifier leverages the sum of the label values assigned by the 11 individual classifiers to predict humorousness, yielding 4.8% improvement of macro F1 compared to the individual classifiers (62.4% vs. 57.6%).We relegate details to Section B.
Stage 2: To find more funny title candidates to annotate, the two annotators annotated the funniest 396 titles in the original dataset from Beese et al. ( 2022), predicted by the ensemble classifier developed in Stage 1; 75.8% (300 titles) were labeled as FUNNY or FUNNY medium , which is substantially higher than the proportion of funny titles in the annotated data of Stage 1 (7.3%), bringing us 300 more funny titles.Thus, the annotated data expands to 2,441 titles (1, 730 + 315 + 396 = 2, 441), where 1,893 are labeled as ¬FUNNY, 492 as FUNNY medium and 56 as FUNNY.Subsequently, we re-train 11 classifiers on newly generated 11 data splits from the expanded data of 2,441 titles; now the train set of each split is composed of 400 funny or medium funny titles and 800 not funny titles.As before, the 11 classifiers are then used to construct an ensemble classifier by means of the optimal ensemble strategy identified in Stage 1.
We test the classifiers from both stages on a heldout test set containing 197 titles annotated by the two annotators, who obtain 0.649 Kappa on this evaluation data.The macro F1 scores of those classifiers are presented in Table 3.As FUNNY titles are rare in the whole dataset, we also evaluate the classifiers on the corresponding binary classification task, where FUNNY and FUNNY medium are merged to one category.We observe that: (1) ensemble classifier performs better than the individual ones.( 2  classification, implying that the three-way classification is hard, but the classifiers can handle the binary classification task.Besides, we see a consistent improvement of human annotation quality: the two annotators achieve 0.01-0.1 higher Kappa agreement when their annotations are down-scaled to binary ones (see Final Dataset To obtain the final dataset, we use our humor classifier to automatically label the rest of the data (not annotated).Considering the difficulty of the three-way classification task for both human annotators and automatic classifiers, we only consider two humor levels in further generation experiments: (1) FUNNY (for funny and medium funny titles) and ( 2) ¬FUNNY (for not funny titles).Therefore, we collect 31,541 instances (>95%) with FUNNY and 1,411 with ¬FUNNY titles, where each instance now consists of the abstract-title pair and the humor label for the title, as illustrated in Table 2. Subsequently, we split the resulting data to train, dev, and test sets, ensuring that (1) the data with human-annotated titles remains in the train set, as the humor classifier trained and evaluated on it will be used as an automatic humor evaluator; (2) 80% of the data in dev/test is from NLP and 20% from ML because our annotators are more knowledgable for NLP papers, and (3) the ratio of FUNNY data to ¬FUNNY data is 1:2. 3 As ¬FUNNY data is only a small portion of the whole dataset, we only keep 600 instances in the dev/test sets, whereas the remaining data serves as the training data.In Table 4, we summarize the statistics of the final dataset.
Examples of human annotated funny instances are in the appendix.We note that FUNNY medium often contain acronyms for datasets or models, while FUNNY typically contain a 'genuine' humor component.

Title Generation
In the first phrase of the experiments, we explore whether existing state-of-the-art Seq2Seq models manage to generate human-level titles from abstracts.Hence, we do not include humor constraints in the inputs of the generation systems.

Fine-tuning
For all baseline models, we continue fine-tuning them on the abstract-title pairs from our dataset5 with AdamW Optimizer (Loshchilov and Hutter, 2019) and linear learning rate scheduler, and subsequently use beam search (Vijayakumar et al., 2016) as the sampling strategy to generate the output candidates.The optimal checkpoint for each model is selected based on the ROUGE1/2/L (Lin, 2004) scores on the dev set.We relegate the (hyper)parameter settings for training and beam search to Appendix C.

Evaluation
We assess the performance of the generation systems on 230 abstracts using both automatic evaluation metrics and human evaluation.Besides the six automatic generation systems, we also include the human-generated titles in the evaluation; the respective system is denoted as 'HUMAN'.

Automatic Evaluation
As there are no A2T task-specific evaluation metrics, we use the following existing evaluation metrics designed for other NLG tasks such machine translation or summarization: BERTScore (Zhang et al., 2020), MoverScore (Zhao et al., 2019), COMET (Rei et al., 2020), BARTScore (Yuan et al., 2021), MENLI (Chen and Eger, 2022).In the evaluation, we employ all considered metrics in both reference-based and -free settings.In the referencebased setup, the metrics compare the system titles with the original ones (human-generated titles), while in the reference-free evaluation, the system titles are directly compared to the abstracts.Further, we examine whether those evaluation metrics are reliable to evaluate the titles' quality, based on the human evaluation results.

Human Evaluation
The human evaluation is conducted in the referencefree setup: 15 annotators6 were asked to select two best and two worst titles among six titles from different systems (including HUMAN), given the abstract.Each instance (an abstract and its six titles) was evaluated by two to five annotators.The average percentage agreement over all annotator pairs is ∼50%, implying that each two annotators agree on one selection among the two selected best/worst titles, on average.
Then, we use best-worst scaling to obtain the final human score for each title.The final human score (BWS) is calculated as: where N best/worst refers to the number of times that the title was selected as one of the best/worst two titles and N annotators indicates the number of annotators responsible for that instance.Therefore, BWS ranges from -1 to 1.

Results
(a) Reference-based Evaluation  We present the reference-based evaluation results in Table 5(a).HUMAN obtains the highest scores using all metrics, which is unsurprising, as the metrics use the human-generated titles as the anchor text.Among the six automatic generation systems, BART xsum is best, being selected by 3 out of 5 evaluation metrics, followed by BART cnn .
Table 5(b) shows the reference-free evaluation results (including human evaluation).In contrast to reference-based evaluation, only two evaluation metrics (COMET and MENLI) select HUMAN as the best generation system.BART xsum is still the best among the six automatic generation systems, obtaining best results on 4 out of 6 evaluation metrics (including BWS).Surprisingly, it outperforms HUMAN even in the human evaluation (0.197 vs. 0.181 BWS).Nevertheless, as Figure 2(a) shows, HUMAN was still most frequently selected as among the two best titles (23.2%) among all generation systems, whereas the best neural generation system BART xsum was selected in 16.9% of the cases as one of the best two titles.However, we observe that HUMAN was also more often selected as among the two worst titles (14.1% vs. 9.3% BART xsum ; see Figure 2(b)), explaining why BART xsum is better than HUMAN in human evaluation.We analyzed cases in which the human title was selected as among the two worst.Introspection shows that this is mostly due to words in the title which do not appear in the abstract.As a consequence, human annotators may believe that the model is hallucinating.Overall, we thus believe that there is a (slight) mismatch in our task definition: human authors may leverage the whole paper when designing their titles, not only the abstracts.However, paper2title generation would not only be a challenge for the text generation models (which are often limited in text length) but also for the human annotation process.We argue that framing the problem as abstract2title generation is a simplification with overall good tradeoffs between problem complexity and model and annotator capacity.

Correlation
To inspect the reliability of the used metrics, we calculate Pearson correlation with system-level hu- man judgements, i.e., average BWS per system, on the 1380 titles (230 instances × 6 titles = 1380 titles).From Table 6 (left block), we observe that: (1) most metrics perform better in the ref-based setup compared to ref-free, except for COMET; (2) only ref-free COMET correlates well with human judgements (0.929 vs. -0.7-0.58),indicating that the majority of the examined metrics are not adequate for A2T generation.

A2TMetric
Based on the above findings, we develop the first supervised A2T generation-specific evaluation metric, using the human judgments collected in the evaluation for the 230 instances.Since human is included in the evaluation as a generation system, and the metrics will later be used to evaluate system-generated humorous titles, which may vastly differ from the original ones, we argue that a ref-free metric will better suit our needs.
Dataset We split the data of 230 instances to train (170 instances), dev (25 instances), and test (35 instances) set.We note that many titles receive a BWS of 0 when the number of annotators is small (because they were never selected as the best or worst two titles), which may be problematic to directly train a regression model.Besides, the human evaluation was similar to the ranking process.Therefore, we convert BWS in the train and dev set to relative-ranking judgements (Ma et al., 2018).I.e., if two titles for one abstract obtain different BWS, this title pair is considered as one relativeranking judgement.Each instance then contains one abstract, a "better" title, a "worse" title, and the score difference between the two titles in addition.As a consequence, our train set consists of 2245 instances, whereas the dev set has 309 instances.
Framework We adopt a similar framework to the ranking-based variant of COMET to train the A2T metrics but in a ref-free setup.During training, the model optimizes the embedding space so that (1) the sentence embedding of the abstract (a) is closer to that of the "better" title (t + ) than to that of the "worse" title (t − ) (using the Triplet Margin loss (Schroff et al., 2015)) and ( 2) the difference between d(a, t + ) and d(a, t − ) is close to the difference in BWS human scores for the two titles (using the MSE loss), where d(u, v) refers to the Euclidean distance between u and v.During predicting, the metrics calculate the Euclidean distance between the sentence embeddings of the abstract and the title.

Humorous Title Generation
In the second phase of the experiments, we use the optimal model identified previously, i.e., BART xsum , to generate titles with constraints on humor level.The input of the generation systems is formulated as "humor level [SEP] abstract", where humor level is either 0 (for ¬FUNNY) or 1 (for FUNNY).
Fine-tuning We fine-tune generation systems here using AdamW optimizer with linear learning rate scheduler and obtain the titles with beam search as in Section 4.2: (1) we fine-tune a BART xsum on the abstract-title pairs in the train set with humor constraints.(2) We continue finetuning the model from (1) on self-generated pseudo data.
The motivation of ( 2) is that we observe that the systems tend to ignore the humor constraints in the input and generate identical titles for different constraints in the initial experiment.We assume that to expose the systems to titles with different humor levels for the same abstract during training can encourage the models to pay more attention to the humor constraints.To obtain the pseudo data, we perform the following steps: (i) We generate titles for abstracts in the train set but with "opposite" humor constraints compared to the original titles.For instance, if the original title for a certain abstract has label "0" for humor, we then generate a pseudo title with constraint "1" for this abstract, utilizing the model obtained from (1).
(ii) We keep only the pseudo titles having the correct humor labels assigned by the humor classifier.
(iii) We filter out the titles labeled as FUNNY containing extremely frequent N-grams, in order to encourage the systems to generate more diverse titles.7 (iv) We finally merge the filtered pseudo data with the original data.Thus, in the training data of (2), each abstract has two titles, one with label FUNNY and the other with ¬FUNNY.
The training hyperparameters can be found in Appendix D. To monitor the models' ability to generate titles on correct humor levels, we use macro F1 between the expected humor labels (i.e., the humor constraints given to the inputs) and the humor labels assigned to the generated titles by the humor classifier as the performance indicator, with which on the dev set we select the optimal model checkpoints of the two systems.

Automatic Evaluation
Based on the analysis results for the automatic evaluation metrics in Section 4.4, we only leverage COMET and our supervised metric A2TMetric here to evaluate the titles' quality.To evaluate the systems' ability to generate titles on correct humor levels, we use the following three metrics: (1) F1 macro between the expected humor labels and those assigned by the humor classifier.
(2) System accuracy of generating titles on correct humor levels, denoted as ACC FUNNY and ACC ¬FUNNY . (3) The ratio of the cases that the systems generate the  same titles for both humor constraints to all generation cases (Ratio SAME ); lower ratio indicates better performance.
We generate titles with constraint on both humor levels for all abstracts in the test set.Thus, we compute the automatic evaluation on 1200 titles in total.
Results From Table 7, we observe that: (1) after continued training on the pseudo data, the system (BART xsum +pseudo) achieves substantially higher F1 macro (from 0.647 to 0.856) and ACC FUNNY (from 40.2% to 77.8%), and slightly better Ratio SAME (from 6.5% to 4.7%), implying the effectiveness of this training procedure.(2) However, ACC ¬FUNNY drops slightly compared to BART xsum (94.5% vs. 93.6%),indicating that both systems have good performance on generating ¬FUNNY titles and the fine-tuning on pseudo data only benefits to improve the system's ability to generate FUNNY titles.
We then present the quality evaluation results in Table 8.Both BART neural systems can obtain better results than HUMAN, which is in line with the observation in Section 4.3.3,especially when generating ¬FUNNY titles, achieving higher scores than HUMAN on both evaluation metrics (0.0593 to 0.0598 vs. 0.0586 on COMET, and -2.3130 to -2.3014 vs. -2.3566 on A2TMetric).However, we observe a consistent performance drop after training on the pseudo data (values in the first row vs. those in the second row).Further, we also note that the system generated ¬FUNNY titles have better quality than the FUNNY ones (values in the left column vs. those in the right column for each evaluation metric).

Human Evaluation
We randomly sample 50 abstracts from the test set with controls on the source of the papers (80% from NLP and 20% from ML) and on the humor label of the original titles (50% FUNNY and 50% ¬FUNNY), to generate the titles for human evaluation.Each of two annotators evaluates 25 instances separately; each instance contains one abstract and its five titles: 1 original title + 4 system titles (2 generation systems × 2 humor levels). 8During evaluation, the annotators rank the 5 titles on two criteria: general quality and humor degree, based on the abstract, being unaware of generation systems and humor constraints; soft ranking is allowed here, i.e., the annotators can assign identical ranks to multiple titles if they can not differentiate the titles according to the evaluation criterion.

Results
In Table 9, we compare the two BARTbased generation systems.Similar to automatic evaluation, we observe (1) a general quality drop but a performance boost regarding humor generation after training on the pseudo data and (2) that the ¬FUNNY titles have better quality compared to the FUNNY ones.9: Average rank of the system titles for all abstracts in the human evaluation of general quality and humor degree; smaller values denotes higher ranks."Humor constraint" refers to the constraints given to the input of the generation systems.
Further, we compare the system titles with the original human generated titles in Table 10.HUMAN is consistently ranked first almost across all evaluation criteria: human generated titles are funnier (1.52 vs. 1.72 to 1.88) and have better quality (2.60 vs. 2.72 to 3.20 for FUNNY and 2.08 vs. 2.32 to 2.68 for ¬FUNNY titles) than the system generated ones; the latter observation contradicts with the conclusion from the automatic evaluation, where the system-generated ¬FUNNY titles obtain higher scores compared to the human generated.Table 10: Average rank of the system titles for the abstracts with original titles labeled as FUNNY and ¬FUNNY separately in the human evaluation of general quality and humor degree; smaller values denotes higher ranks."Humor constraint/label" refers to the constraints given to the input of the generation systems and the humor labels of the original titles.
On introspection, we find that the models often overfit to example patterns seen in the training data when generating humorous titles which may not fit in the context of a specific paper's title, e.g., "Don't do X" where X is irrelevant to the current paper and/or directly taken from the training data.We compare our fine-tuned BART xsum (without training on pseudo data) with the recent popular ChatGPT9 model.Firstly, we use the two models to generate funny and not funny titles for 48 abstracts from the recent EMNLP 2022 handbook10 which ChatGPT could not have seen in its training data.The prompt used for ChatGPT here is "I want a funny title and a notfunny title for the following abstract: [abstract]"; then the model will return two titles with an indication of which is the funny or not funny title.The ranking-based human evaluation conducted here is identical to that described in Section 5.1.2.The two evaluators initially assess 10 instances, obtaining Kappa scores of 0.783 and 0.347 for humorousness and quality ranking, respectively, indicating potentially low agreement on quality ranking.However, when we only consider the best and worst titles selected, the percentage agreement on quality assessment rises to a satisfactory level (69.2%).We then use the average ranks over the two evaluators as the final ranks for the titles in the first 10 instances.Subsequently, each evaluator separately assesses another 19 instances.
The average rank per system with humor constraint is presented in Table 11.We observe that ChatGPT generates funnier but lower-quality titles compared to BART xsum .Nevertheless, in the quality ranking of ¬FUNNY titles, compared to ChatGPT, BART xsum is better in 23 out of 43 cases, while it is worse in 20 out of 43 cases (in 5 out of 48 cases the titles from these two systems rank equally.).Hence, we conclude that ChatGPT without any fine-tuning may already perform similarly to our fine-tuned BART xsum , acknowledging the small-scale of our human annotation, however. 11valuation results suggest that system-generated funny titles tend to rank lower in the quality evaluation, compared to the not funny ones.Examples of the system-generated titles with high humor ranks but low quality ranks are shown in Appendix E. As before, we note that those titles frequently incorporate phrases unrelated to the content and demonstrate deficiencies in terms of information correctness and coverage.

Conclusion
We considered the abstract-to-title generation problem using end-to-end models.To do so, we trained six recent text-to-text generation systems on more than 30k NLP and ML papers.We evaluated the systems using an array of state-of-the-art automatic metrics as well as human evaluation.Our evaluation indicates that some current text generation models can generate titles with similar quality than humans, but human authors are apparently still superior.We also considered the humorous title generation problem as an extension, compiling the first dataset in the NLP/ML domain in this context, comprising almost 2.5k titles annotated by two annotators with acceptable agreement.We find that our systems struggle with generating humorous titles and instead overfit to frequent patterns in the data, indicating much scope for future research.

A Examples of funny titles
Table 12 shows 20 titles labeled as FUNNY or FUNNY medium by human annotators.

B Humor annotation + classifiation
The two annotators first annotated the same 230 titles independently, obtaining only 0.397 Kappa agreement, which indicates a relatively bad annotation quality.To improve the inter-agreement between the annotators, they then discussed the reasons leading to disagreement.Subsequently, they annotated another 300 titles independently, achieving a decent 0.650 Kappa for a task as subjective as humor.As a consequence, we use the maximal label value among the two annotations for each title as its final label for the 300 titles, i.e., if one annotator labels a title with 1 (FUNNY medium ), while the other labels with 0 (¬FUNNY), we assign label 1 to the title.Each annotator then labeled 600 different titles separately, bringing 1,730 (230 + 300 + 600 × 2 = 1730) annotated titles in total, where 1,603 titles labeled as ¬FUNNY, 106 as FUNNY medium and 21 as FUNNY.
As the funny titles (labeled as Funny ) are very few compared to the not funny ones (labeled with 0), we generate 11 different data splits, where the train set of each split consists of 100 funny titles and 200 not funny ones (randomly sampled from the 1730 titles), while the remaining 27 funny titles and other 27 not funny ones compose the dev set.From the 11 different data splits, we obtain 11 classifiers (checkpoints selected based on the macro F1 on each dev set).We then evaluate the ensembles of the 11 classifiers on 315 newly annotated titles by the two annotators, who obtain 0.639 Kappa agreement this time.With this step, we study the optimal ensemble of the classifiers.and also obtain more funny titles from the whole data by annotating the funniest titles selected by the ensemble classifiers.We design two types of ensemble classifiers: • EnsMV, which relies on the majority vote of the 11 classifiers.Specifically, each title receives 11 labels from the 11 classifiers: if the number of ¬FUNNY labels exceeds 5, the title is labeled as ¬FUNNY; if not, the title is labeled as FUNNY when the number of FUNNY labels exceeds the number of FUNNY medium labels, otherwise it is labeled as FUNNY medium .
• EnsSUM i,j , which depends on the sum of the label values.The sum of the label values for each title ranges from 0 (11 classifiers × 0 for ¬FUNNY) to 22 (11 classifiers × 2 for FUNNY).We then select a threshold i for FUNNY medium and j for FUNNY: if sum < i, the title is labeled as ¬FUNNY; otherwise it is labeled as FUNNY medium (when sum < j) or FUNNY (when sum ≥ j).
Individuals Ensembles EnsMV EnsSUM 7,16 F1 57.6% 61.4% 62.4% Table 13: Average macro F1 over the 11 individual classifiers and macro F1 of the ensemble classifiers from stage 1 on the evaluation data of 315 titles (where the two annotators obtain 0.639 kappa).We bold the highest macro F1 score.
Table 13 shows the evaluation results of Stage 1; we only present the performance of EnsSUM i,j with optimal i and j here, i.e., EnsSUM 7,16 .We observe that: (1) both ensembles perform better than the individual ones (+4-5% macro F1) and ( 2 14: Kappa agreements between the two annotators on several data pieces."#titles" refers to the number of titles in a certain piece of data.We bold the higher Kappa on the same data.

D Parameters for humor generation
We train BART xsum on our train set using the AdamW optimizer with weight decay 0.01 and learning rate 4e-05 for 5 epochs.Then we continue to train it on the pseudo data for one epoch to obtain BART xsum +pseudo.We use the default settings in Huggingface's Trainer API for the other hyperparameters.
E Examples of low-quality system-generated funny titles

Figure 1 :
Figure 1: Histogram of the length of abstracts.We define the length of a text as the length of the list split by spaces for each text.

Figure 2 :
Figure 2: Distribution of generation systems of the titles selected as the BEST/WORST ones in human evaluation; percentages indicate the proportion of the generation systems being selected over all selections.

Table 1 :
Example of the annotated titles.

Table 2 :
Andrychowicz et al. (2016) are superior compared to the ones from Stage 1, indicating larger size of the training data is beneficial.(3)Thebestthree-wayclassifierachieves only ∼58% macro F1, whereas it has ∼88% macro F1 on the binary Abstract Title LabelThe move from hand-designed features to learned features in machine learning has been wildly successful.[...]We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.Example of the instance in our dataset.Each instance contains the abstract, title, and the label regarding the title's humor degree for a certain paper.This instance contains the data for paperAndrychowicz et al. (2016).

Table 3 :
Average macro F1 over the 11 individual classifiers and macro F1 of the ensemble classifiers from both stages on the held-out test set (where the two annotators obtain 0.649 kappa agreement).We bold the highest macro F1 for both classification tasks.

Table 4 :
Distribution of the source (NLP or ML) and humor labels (FUNNY or ¬FUNNY) of the instances in our dataset.
Table 14 in Appendix B).As a consequence, we use the ensemble classifier from Stage 2 as the humor classifier in further experiments.

Table 5 :
Evaluation Results of the baseline models.We underlie the best performance among all generation systems including human.We bold the best performance among all automatic generation systems excluding human.
F1 macro ACC ¬FUNNY ACC FUNNY Ratio SAME

Table 7 :
Automatic evaluation for the systems' ability to generate titles with correct humor constraints.We bold the best performance.

Table 8 :
Automatic evaluation for titles' quality.We bold the best performance assessed by each metric."Humor constraint" refers to the constraints given to the input of the generation systems.

Table 11 :
Average ranks of the titles for the 48 abstracts from EMNLP 2022 handbook in the human evaluation of quality and humorousness; smaller values denote higher ranks.We bold the highest ranks.

Table 12 :
Examples of funny titles in the annotated data.
Table 15 displays the hyperparameter used in Section 4.2 for training the six models and Table 16 shows the parameters used for beam search.

Table 15 :
Table17shows 10 system-generated low-quality funny titles, of which 5 are from ChatGPT and 5 from BART xsum .They obtain good results in the humor ranking but bad results in the quality ranking according to our human evaluation.learning rate batch size epochs gradient accumulation steps Training hyperparameter for title generation.We use the AdamW optimizer with a weight decay of 0.01 and keep the other settings as default in Huggingface's Trainer API.

Table 16 :
Parameter settings for beam search.Invite Adversaries to Poison Your Data: Exploiting Federated Learning for Adversarial Backdoor Attacks Don't Take the Easy Way Out: Generating Adversarial Negative Responses with Large-Scale Language Models for Dialogue Selection Don't Give Up on Style: Learn to Generate Stylistically-Diverse Summaries with Multiple Decoders CKD: Curriculum Knowledge Distiller for Cross-Lingual Sentiment Analysis with Emoji Successive Prompting: Learning to Break Down Complex Questions into As Simple As Possible ChatGPT Graphin' It Up: A Humorous Guide to Generative Knowledge Construction Tiny Tasks, Big Results: A Hilarious Guide to Few-Shot Relation Extraction Revealing the Magic Behind Transformer Language Models: A Lighthearted Investigation Ask and You Shall Receive: A Whimsical Approach to Automatic Question Generation Federated Learning: The More You Poison, the More You Win!

Table 17 :
Examples of system-generated low-quality funny titles, which obtain high humor ranks but low quality ranks in the human evaluation.