SemEval-2020 Task 7: Assessing Humor in Edited News Headlines

This paper describes the SemEval-2020 shared task “Assessing Humor in Edited News Headlines.” The task’s dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. The second subtask is to predict, for a pair of edited versions of the same original headline, which is the funnier version. To date, this task is the most popular shared computational humor task, attracting 48 teams for the first subtask and 31 teams for the second.


Introduction
Humor is an important ingredient of human communication, and every automatic system aiming at emulating human intelligence will eventually have to develop capabilities to recognize and generate humorous content. In the artificial intelligence community, research on humor has been progressing slowly but steadily. As an effort to boost research and spur new ideas in this challenging area, we created a competitive task for automatically assessing humor in edited news headlines.
Like other AI tasks, automatic humor recognition depends on labeled data. Nearly all existing humor datasets are annotated to study the binary task of whether a piece of text is funny (Mihalcea and Strapparava, 2005;Kiddon and Brun, 2011;Bertero and Fung, 2016;Raz, 2012;Filatova, 2012;Zhang and Liu, 2014;Reyes et al., 2012;Barbieri and Saggion, 2014). Such categorical data does not capture the non-binary character of humor, which makes it difficult to develop models that can predict a level of funniness.
Humor occurs in various intensities, and certain jokes are much funnier than others, including the supposedly funniest joke in the world (Wiseman, 2011). A system's ability to assess the degree of humor makes it useful in various applications, such as in humor generation where such a system can be used in a generate-and-test scheme to generate many potentially humorous texts and rank them by funniness, for example, to automatically fill in the blanks in Mad Libs R for humorous effects (Hossain et al., 2017;Garimella et al., 2020).
For our SemEval task, we provided a dataset that contains news headlines with short edits applied to them to make them humorous (see Table 1). This dataset was annotated as described in Hossain et al. (2019) using Amazon Mechanical Turk, where qualified human workers edited headlines to make them funny and the quality of humor in these headlines was assessed by a separate set of qualified human judges on a 0-3 funniness scale (see Figure 1). This method of quantifying humor enables the development of systems for automatically estimating the degree of humor in text. Our task is comprised of two Subtasks:  Table 1: Edited headlines from our dataset and their funniness rating. We report the mean of the estimated ratings from the top 20 ranked participating systems (Est.) and its difference from the true rating (Err.).
• Subtask 1: Estimate the funniness of an edited headline on a 0-3 humor scale.
• Subtask 2: Given two edited versions of the same headline, determine which one is funnier.
Inviting multiple participants to a shared task contrasts with most current work on computational humor, which consists of standalone projects, each exploring a different genre or type of humor. Such projects typically involve gathering new humor data and applying machine learning to solve a particular problem. Repeated attempts at the same problem are rare, hindering incremental progress, which emphasizes the need for unified, shared humor tasks.
Recently, competitive humor tasks including shared data have been posed to the research community. One example is #HashtagWars (Potash et al., 2017), a SemEval task from 2017 that attracted eight distinct teams, where the focus was on ranking the funniness of tweets from a television show. The HAHA competition (Chiruzzo et al., 2019) had 18 participants who detected and rated humor in Spanish language tweets. There were 10 entries in a SemEval task from 2017 that looked at the automatic detection, location, and interpretation of puns (Miller et al., 2017). Finally, a related SemEval 2018 task involved irony detection in tweets (Van Hee et al., 2018).
Ours is the largest shared humor task to date in terms of participation. More than 300 participants signed up, 86 teams participated in the development phase, and 48 and 31 teams participated, respectively, in the two subtasks in the evaluation phase. By creating an intense focus on the same humor task from so many points of view, we were able to clearly understand how well these systems work as a function of different dimensions of humor, including which type of humor appears easiest to rate automatically.

Datasets
The data 1 for this task 2 is the Humicroedit dataset described in our previous work (Hossain et al., 2019). This dataset contains about 5,000 original headlines, each having three modified, potentially funny versions for a total of 15,095 edited headlines. The original headlines were collected from Reddit (reddit.com) via the popular subreddits r/worldnews and r/politics, where headlines from professional news sources are posted everyday. These headlines were published between 01/2017 and 05/2018, they are between 4-20 words long, and they are sampled from headlines written by 25 major English news sources.
The data was annotated using workers from Amazon Mechanical Turk, who were screened using a qualification phase to find expert headline editors and judges of humor. The editors were instructed to make a headline as funny as possible to a generic wide audience by applying a micro-edit, which is a replacement of a verb/noun/entity in the headline with a single word. Examples are shown in Table 1. By allowing only small edits, researchers can examine humor at the atomic level where the constrained degrees of freedom are likely to simplify analysis, understanding, and eventually generation.
Five judges were asked to rate the funniness of each edited headline using the following humor scale: 0 -Not funny 1 -Slightly funny 2 -Moderately funny 3 -Funny The funniness of an edited headline is the mean of the ratings from its five judges. For further details and analysis of the dataset, we refer the reader to Hossain et al. (2019).  For our task, we randomly sampled the Humicroedit dataset into train (64%), dev (16%) and test (20%) sets such that all edited versions of an original headline reside in exactly one of these sets, as opposed to the sampling in Hossain et al. (2019) which allowed overlap of original versions of headlines among its dataset partitions for a slightly different humorous headline classification task.
We also provided additional training data 3 from FunLines 4 (Hossain et al., 2020), a competition that we hosted to collect humorous headlines at a very low cost. The data collection approach for Humicroedit and FunLines are mostly similar, but FunLines additionally includes headlines from the news categories sports, entertainment and technology, and its headlines were published between 05/2019 and 01/2020, for a total of 8,248 annotated headlines. More than 40% of the participating teams, including the winning team, made use of the FunLines data.

Task Description
The objective of this shared task is to build systems for rating a humorous effect that is caused by small changes in text. To this end, we focus on humor obtained by applying micro-edits to news headlines.
Editing headlines presents a unique opportunity for humor research since headlines convey substantial information using only a few words. This creates a rich background against which a micro-edit can lead to a humorous effect. With that data, a computational humor model can focus on the exact localized cause of the humorous effect in a short textual context. We split our task into two subtasks. The dataset statistics for these subtasks are shown in Table 2.

Subtask 1: Funniness Regression
In this task, given the original and the edited versions of a headline, the participant has to estimate the mean funniness of the edited headline on the 0-3 humor scale. Systems tackling this task can be useful in a humor generation scenario where generated candidates are ranked according to expected funniness.

Subtask 2: Funnier of the Two
In this task, given the original headline and two of its edited versions, the participating system has to predict which edited version is the funnier of the two. Consequently, by looking at gaps between the funniness ratings, we can begin to understand the minimal discernible difference between funny headlines.

Metrics
For Subtask 1, systems are ranked using the root mean squared error (RMSE) between the mean of the five annotators' funniness ratings and the rating estimated by the system for the headlines. Given N test samples, and given the ground truth funniness y i and the predicted funninessŷ i for the i-th sample: For Subtask 2, which attempts to find the funnier of the two modified versions of a headline, the evaluation metric is classification accuracy. We also report another auxiliary metric called the reward. Given N test samples with C correct predictions, and given the i-th sample, the funniness ratings of its two edited headlines f (1) i and f (2) i , its ground truth label y i and its predicted labelŷ i : In other words, for a larger funniness difference between the two edited headlines in a pair, the reward (or penalty) is higher for a correct classification (or misclassification). We ignore cases where the two edited versions of a headline have the same ground truth funniness.

Benchmarks
Subtask  Table 3: Benchmarks on the test set. The best within each model type is bolded, and the overall best is underlined.
We provide several benchmarks in Table 3 to compare against participating systems: 1. BASELINE: assigns the mean rating (Subtask 1) or the majority label (Subtask 2) from the training set. 2. CBOW: the context independent word representations obtained using the pretrained GloVe word vectors with 300d embeddings and a dictionary of 2.2M words. 3. BERT: a regressor based on BERT base model embeddings (Devlin et al., 2019). 4. RoBERTa: same regressor as above but uses RoBERTa embeddings (Liu et al., 2019).
For a thorough discussion of these benchmarks, we refer the reader to the Duluth system , who performed these ablation experiments.
In summary, each benchmark result uses the edited headline, CONTEXT implies using the headline's context (with the replaced word substituted with [MASK]), ORIG implies using the original headline, FT refers to finetuning, FREEZE implies feature extraction (no finetuning) and FUNLINES refers to using the FunLines training data. The results for Subtask 2 were obtained by using the model trained for Subtask 1 to assign funniness ratings to both the edited versions of a headline and then choosing the version scoring higher.

Results
The official results for Subtasks 1 and 2 are shown, respectively, in Tables 4 and 5, including the performance of the benchmarks. There were 48 participants for Subtask 1, while Subtask 2 attracted 31 participants. For both subtasks, the best performing system was Hitachi, achieving an RMSE of 0.49725 (a 13.5% improvement over BASELINE) for Subtask 1, and an accuracy of 67.43% (a 17.93 increase in percentage points over BASELINE) for Subtask 2.

Overview of Participating Systems
The dominant teams made use of pre-trained language models (PLM), namely BERT, RoBERTa, ELMo (Peters et al., 2018), GPT-2 (Radford et al., 2019) and XLNet (Yang et al., 2019). Context-independent word embeddings, such as Word2Vec (Mikolov et al., 2013), FastText (Joulin et al., 2017) and GloVe word vectors (Pennington et al., 2014), were also useful. The winning teams combined the predictions of several hyperparameter-tuned versions of these models using regression in an ensemble learner to arrive at the final prediction. Next, we summarize the top systems and other notable approaches.  First, we note that for Subtask 2, most systems relied on the model they developed for Subtask 1. This involved using the model to estimate a real number funniness rating for each of the two edited headlines, and selecting the one which achieved the higher estimated rating. As a result, there was a strong correlation between teams' placements in Subtask 1 and Subtask 2, with the top 3 teams in both tasks being the same.

The Hitachi System
The winner of both tasks, Hitachi (Morishita et al., 2020), formulated the problem as sentence pair regression and exploited an ensemble of the PLMs BERT, GPT-2, RoBERTa, XLNet, Transformer-XL and XLM. Their training data uses the pairs of headlines, with the replacement word marked with special tokens, and they fine-tune 50 instances per PLM, each having a unique hyperparameter setting. After applying 5-fold cross validation, they selected the 20 best performing settings per PLM, for a total of 700 PLMs (7 PLMs × 20 hyperparameters × 5 folds). They combined the predictions of these models via Ridge regression in the ensemble to predict final funniness scores. Hitachi uses the additional training data from FunLines.

The Amobee System
Amobee (Rozental et al., 2020) was the 2nd placed team for both Subtasks. Using PLM token embeddings, they trained 30 instances of BERT, RoBERTa and XLNet, combining them for an ensemble of 90 models.

The YNU-HPCC System
Unlike the top two systems, the 3rd placed YNU-HPCC (Tomasulo et al., 2020) employed an ensemble method that uses only the edited headlines. They used multiple pre-processing methods (e.g., cased vs uncased, with or without punctuation), and they encoded the edited headlines using FastText, Word2Vec, ELMo and BERT encoders. The final ensemble consists of 11 different encodings (four FastText, two W2V, four Bert, one ELMo). For each of these encodings, a bidirectional GRU was trained using the encoded vectors. In the ensemble, the GRU predictions were concatenated and fed to an XGBoost regressor.

MLEngineer
The MLEngineer (Shatnawi et al., 2020) team also used only the edited headlines. They fine-tune and combine four BERT sentence regression models to estimate a rating, and they combine it with the estimated rating from a model that incorporates RoBERTa embeddings and a Naïve Bayes regressor to generate the final rating.

The LMML and ECNU Systems
These systems (Ballapuram, 2020; estimate the funniness of headlines using a neural architecture that focuses on the importance of the replaced and replacement words against the contextual words in the headline. They use BERT embeddings and compute feature vectors based on the global attention between the contextual words and the replaced (and replacement) word. These two vectors and the vectors of the replaced and replacement are combined, and the resulting vector is passed through a multi-layer perceptron to estimate the headline's funniness.

Other Notable Approaches
Rank Team Accuracy Reward  ECNU used sentiment and humor lexicons, respectively, to extract polarities and humor rating features of headlines. They also used the average, minimum and maximum humor ratings of replaced/replacement words from the training set as additional features. LT3 (Vanroy et al., 2020) created an entirely featureengineered baseline which obtained an RMSE of 0.572. It uses lexical, entity, readability, length, positional, word embedding similarity, perplexity and string similarity features.
IRLab DAIICT trained five BERT classifiers, one for each of the five ratings for a headline, and calculated the mean of the five classifiers' outputs. This mean was further averaged with the output of a BERT regression model which predicts the overall mean rating.
Hasyarasa (Desetty et al., 2020) used a word embedding and knowledge graph based approach to build a contextual neighborhood of words to exploit entity interrelationships and to capture contextual absurdity. Features from this and semantic distance based features are finally combined with headline representations from a Bi-LSTM.
UTFPR (Paetzold, 2020) is a minimalist unsupervised approach that uses word co-occurrence features derived from news and EU parliament transcripts to capture unexpectedness.
Some noteworthy pre-processing techniques included non-word symbol removal, word segmentation, manually removing common text extensions in headlines (e.g. "-live updates"). Finally, notable datasets used were the iWeb corpus 5 and a news headline corpus 6 .

General Trends
Here we discuss the relative merits of the different systems, with respect to the participants' findings. Table 3 suggests that contextual information is useful in our humor recognition tasks, since the context independent GloVe embeddings (CBOW) led to weaker performance compared to using the contextsensitive BERT and RoBERTa embeddings.
According to ablation experiments by Hitachi (Morishita et al., 2020), the ranking of best performing to least superior individual PLM are as follows: RoBERTa, GPT-2, BERT, XLM, XLNet and Transformer-XL. Analysis performed by several task participants indicates that the neural embeddings were unable to recognize humor where a rich set of common sense and/or background knowledge is required, for example, in the case of irony.
Lastly, a few systems had quite low accuracy for Subtask 2. They reported having bugs that caused them to submit a random baseline, which has about a 33% chance of success (since the possible predictions were "headline 1 is funnier", "headline 2 is funnier" and "both headlines have equal funniness").

Analysis and Discussion
The outputs of 48 participating systems for Subtask 1 and 31 for Subtask 2 present an opportunity to not only study individual solutions and numeric results, but to also take a deeper qualitative look at the output of these systems. Here, we collectively analyze the performance of the top 20 systems per subtask to find aggregate trends that characterize the general approaches and the challenges of assessing humor itself. To better understand which funniness ranges are particularly hard for systems to assess, we study the performance of the systems as a function of ground truth funniness. As shown in Figure 2, we grouped the edited headlines into funniness bins of width 0.2. For each bin, we plotted the mean absolute regression errors for the top 20 systems aggregated (max RMSE = 0.547), the winning Hitachi system (RMSE = 0.497), the 19 other systems and BASELINE (RMSE = 0.575).

Subtask 1 (Regression)
In general, all these systems have their minimum error at a funniness score of about 1.0. While the Hitachi system stands out somewhat in its superior performance at the two extremes of the funniness scale, the other systems follow generally the same pattern, and none appear to be outliers. Assessing more extreme humor (or lack thereof) appears to be harder since all the systems have larger errors toward the extremes of the funniness scale. This may also be due to the non-uniform distribution of ground truth funniness scores in the dataset (shown as the blue curve), with the extreme values being less frequent.

Antipodal RMSEs
Figure 3: Overall and antipodal RMSE of the ranked participating systems and BASELINE for Subtask 1. Figure 3 shows the systems' antipodal RMSE, an auxiliary metric for Subtask 1, which we calculated by considering only the X% most funny headlines and X% least funny headlines, for X ∈ {10, 20, 30, 40} in the RMSE metric. The systems are ranked by their overall RMSE for Subtask 1. It appears that some of the systems further down the ranking are doing much better at estimating the funniness of the extremes in the dataset than their superiors. For example, the large dip shows the system ranked 41 (Hahackathon) is performing better at estimating the funniness of the top 10-40% most/least funny headlines than several systems ranked before it. This sug-gests that combining these approaches can yield better results, for example, using some selected systems to rank certain subsets of headlines.

Systematic Estimation Errors
We now analyze headlines for which the ratings from the top 20 systems were all either underestimates or overestimates. Table 1 shows examples of these headlines, their ground truth funniness rating, the mean of the estimated ratings of the top 20 systems and its difference from the ground truth.
Lack of understanding of world knowledge (Headline R1), cultural references (R2) and sarcasm (R3, R4 and R5) are clearly hurting these systems. The models are having difficulty recognizing the effects of negative sentiments on humor (R7 and R8) and the complex boundaries between negative sentiment and sarcastic humor (R4 and R8 both discuss death but R4 does it in a funny way). A better understanding of common sense could have helped resolve these subtleties. R3 also has the humorous effect brought about by a tension relief, which is a complex phenomenon to model. Finally, the systems are not expected to infer that bathroom humor (R6) was purposely annotated as "not funny" in the data (Hossain et al., 2019).

Subtask 2 (Classification)
Here we examine the top 20 aggregate system performances on Subtask 2. These 20 systems have at least 59.7% classification accuracy, much higher than the 49.5% accuracy of BASELINE.
First, we analyze the difficulty of the classification by calculating the percentage of headline pairs correctly classified by exactly N systems, for 0 ≤ N ≤ 20, as shown in the blue curve in Figure 4(a). As an example, there is a subset of about 3% of the headline pairs that were correctly classified by 10 of the top 20 systems. The curve rises rapidly to the right, indicating that a large fraction of the pairs can be correctly classified by 16 or more systems.

Incongruity at Play
We investigate to what extent the participating systems model incongruity as a cause of humor, as postulated in the incongruity theory of humor (Morreall, 2016). This theory claims that jokes set up an expectation that they violate later, triggering surprise and thereby generating humor. We test this hypothesis by examining the cosine distances between the GloVe vectors of the original word and each replacement word. We assume that the larger this distance is, the higher is the expected incongruity.
The dashed curve in Figure 4(a) shows the incongruity measure obtained using GloVe word distances: incongruity difference = distance(orig, edit 2 ) -distance(orig, edit 1 ) incongruity measure = correlation(incongruity difference, ground truth label ∈ {1, 2}) This rising curve implies that the funnier headline in a pair is recognized by more systems if its replacement word is more distant from the original word compared to the distance between the original word and the less funny headline's replacement word. This indicates that these systems are possibly detecting which headline in the pair is more incongruous compared to the original headline. Moreover, for the headline    Table 6: Examples from Subtask 2 where the top 20 systems collectively either failed () or succeeded () in recognizing the funnier headline. On the overall dataset, these were the extreme headline pairs, having either the largest or the smallest differences in funniness between their headlines. We also report the GloVe word vector distances, mapped to the range 0-2, between the replaced and replacement words.
pairs which were incorrectly classified by all systems, the incongruity measure is around -0.6, implying that in these headline pairs, the less incongruous (i.e., more coherent) version is the funnier of the two. This further indicates that these systems are mostly recognizing incongruity and they tend to fail where incongruity is not the cause of humor.

Funniness Gaps
Next, we inspect whether the funniness difference between the two headlines in a pair affects classification accuracy. We calculate the mean absolute funniness difference between the headline pairs within each of the N bins of systems that correctly classified them, as shown in Figure 4(b). For example, the funniness difference between the two headlines in the pairs, which were correctly classified by all 20 systems, was around 0.8 on average. The rising trend in the curve suggests that, in general, more systems are able to correctly classify headline pairs having larger differences in humor. This helps confirm the annotation quality in the dataset, showing that humans and machines both agree on the intensity of humor in the dataset, and both can distinguish between slight humor and extreme humor. Recall also from Section 5.1 that most of the systems for Subtask 2 were simply applying the systems from Subtask 1 to find the funnier of the two headlines by comparing their funniness scores. Pairs with widely different funniness would less likely have overlapping uncertainty, leading to more accurate pairwise rankings.

Extreme Examples
We discuss the collective top 20 system performance on edge case examples, with references to Table 6: • C1: Among all the test examples which were correctly classified by the 20 systems, E1 has the largest funniness difference between its pair of headlines. "Secret service" and "secret police" are quite natural in text and substituting one with the other barely changes the headline's meaning. However, using "secret santa" clearly raises the surprise. All classifiers were able to assess this relatively easy example.
• C2: This is the example with the largest funniness difference which all 20 systems incorrectly classified. This could be because "puppies" is semantically more distant from "billions" than "pennies" (according to GloVe). Although both headline substitutions are funny and incongruous, the antonym effect of the "pennies" version triggers a further sarcastic humor, since "pennies" is numerically much less than the original word "billions", but still in the category of money. Lacking world knowledge of this numerical difference, the systems award the more incongruous "puppies" the higher ranking. As mentioned in 6.2.1, these systems are especially sensitive to general incongruity as a source of humor and they are likely less aware of other causes of humor, such as meaning reversal.
• C3: This example has the smallest funniness difference of the sentences that were correctly classified by all 20 systems. Its less funny headline is sarcastic and most likely all classifiers were unable to recognize sarcasm and thus correctly chose the other headline as the funnier. If this is true, then ignorance about sarcasm was a lucky benefit in this case.
• C4: This was one of the examples with the smallest funniness differences which was misclassified by all systems. Both its headlines are quite funny and they are similar as they both discuss cleaning spaces. However, all systems found bedroom cleaning as a funnier reference than floor cleaning, likely because floor cleaning occurs much more frequently in our day-to-day conversations, making bedroom cleaning a more incongruous substitution to the classifiers, as indicated by the semantic distances in Table 6.

Quirks of the Dataset
It is challenging to effectively construct a dataset that depends on human creativity, such as humor. Not only generating high quality humor requires more effort from humans making the process expensive, but also reliably assessing the level of humor is challenging as humor understanding is subjective. Although we carefully annotated our dataset, we have observed some quirks. Some of our headlines showed lack of sufficient agreement between judges. For example, in the headline C2 in Table 6, the standard deviation in judges' ratings for the "puppies" version (σ = 0.9) was much higher than that in the "pennies" version (σ = 0.4), implying that using more judges for the "puppies" version could have given it a more reliable funniness rating. However, ensuring such quality control would make the data collection process more expensive.
Additionally, some participating teams reported the frequent mention of President Trump in the dataset, and that there were a non-trivial number of headlines that mentioned both "Trump" and "hair", and these headlines had received high humor scores, adding certain biases on the data.
Although the FunLines training data was useful, it was annotated using a different set of judges. It is reasonable to expect that the rating scales of FunLines and our task dataset are not calibrated, and a proper calibration could have possibly increased the value of the FunLines data. However, we have not seen any participating system trying to address this problem, for example, by using a standardization technique to unify the two funniness scales.

Conclusion and Future Perspectives
We provided 15,095 edited and humor-rated, potentially funny headlines and defined subtasks for (1) rating the funniness of each one and (2) determining the funnier headline from a pair that came from editing the same original headline. Both humor subtasks were popular, attracting 48 and 31 teams respectively, showing that shared tasks can unify the relatively smaller humor research community.
For both subtasks, the highly rated solutions show that pre-trained language models work well for rating humor. For Subtask 2, nearly all the participating teams used their solution from Subtask 1 for ranking the two headlines. For Subtask 2, we found that larger disparities in ground truth funniness made ranking easier and that incongruity in a headline was positively correlated with more accurate ranking of humor. For Subtask 1, we discovered that, over the range of funniness scores, the top systems were most accurate at rating humor near the middle of the funniness range where we had the most training data.
For future contests like this, we advocate for more uniformly labeled humor data, though that can be hard and expensive to collect. Another direction worth pursuing is humor recognition in a closed setting such as reading comprehension, where both annotators and systems make judgments based only on a limited amount of provided contextual information. This would constrain the problem, setting a well-defined scope, and potentially lead to stronger annotator agreements.
We also believe that focusing on specific labeled forms of humor, such as incongruity, sarcasm, irony, puns, and superiority would be advantageous. This could help to better understand how different modeling strategies can identify different root causes of humor. We would also want to design Subtask 2 to be more independent of Subtask 1 to encourage fresh approaches for Subtask 2. Finally, improving the common sense and world knowledge understanding capabilities of AI systems will be crucial for substantially improving the performance of computational humor systems. We hope that both the current results and the dataset in this task provide a stepping stone towards this goal.