YNU-HPCC at SemEval-2020 Task 7: Using an Ensemble BiGRU Model to Evaluate the Humor of Edited News Titles

This paper describes an ensemble model designed for Semeval-2020 Task 7. The task is based on the Humicroedit dataset that is comprised of news titles and one-word substitutions designed to make them humorous. We use BERT, FastText, Elmo, and Word2Vec to encode these titles then pass them to a bidirectional gated recurrent unit (BiGRU) with attention. Finally, we used XGBoost on the concatenation of the results of the different models to make predictions.


Introduction
The Humicroedit dataset was created for research in computational humor (Hossain et al., 2019). The authors employed Amazon Mechanical Turk annotators to edit a single word of some fifteen thousand news headlines in order to make them funny. Five other annotators then graded each edited title on a scale of 0-3, with 3 being the funniest. Titles, despite being very short (the longest were around twenty words), convey a lot of information. Using one word edits is a way to make minimal changes for an expressed purpose. These two factors make the Humicroedit dataset a useful tool for studying computational humor.
During the competition, the organizers published the Funlines dataset, which improved on Humicroedit in terms of cost per title by gamifying the process: they provided cash rewards for the best annotators (Hossain et al., 2020b). The performance of annotators was measured on a mix of how funny their edits were and how well their grades tracked the overall average. Players were able to improve in both categories and the authors posit that this was due to the availability of live feedback. On average, the Funlines dataset is funnier than Humicroedit and there was better agreement on the grades. The Funlines dataset was made available to participants of the competition in January.
There were two sub-tasks for SemEval Task 7: 1) for each headline and its edit, predict the mean of the grades assigned by the five annotators; 2) decide which of two title-edit pairs is funnier. We use a few popular pretrained language models in a BiGRU ensemble for the first task, and we use those results for the second task.
The rest of the paper is organized as follows. Section 2 presents a brief review on humor assessment and some NLP tools. Section 3 presents the BiGRU ensemble model in detail. Section 4 contains some discussion of the results and section 5 is a short conlcusion.

Background
Humor presents some challenges that other NLP tasks like entailment and analogy, for example, do not: it is highly contextual and subjective. Because of the latter, we may prefer to ask how funny something is or to whom is it funny, and when, over a question like "Is this funny? Yes or No". So, attempts at computational humor lend naturally to regression tasks.
Continuous representations have proven to be a good basis from which to transform a natural language task into a computational one. In order to produce continuous representations, many researchers have relied on the distributional hypothesis, which states that the meaning of a word is the assortment of  contexts in which it appears. Here we briefly mention the four language models for producing continuous representations that we use for Task 7. Word2Vec trains a shallow neural network to predict a word from its context (skip-gram) model or the context from a word (continuous bag-of-words) (Mikolov et al., 2013a;Mikolov et al., 2013b). FastText does something similar but trains on characters, whose n-grams are averaged to get word vectors (Bojanowski et al., 2017). This allows fastText to perform better than word2vec on out-of-vocabulary (OOV) words.
ELMo trains continuous representations by using an LSTM to predict the next token in a sequence of tokens, and it does this in reverse too, with a bidirectional LSTM (Peters et al., 2018). The representations are then combined linearly, creating context dependent word representations.
BERT uses an attention style, multi-level encoder-decoder structure called a transformer, which can model relationships between tokens that are farther apart than can LSTMs (Devlin et al., 2019). The BERT transformer uses masking on about 15% of tokens-sub-word entities of a non-fixed length. In pretraining, BERT also asks the model to predict whether two sentences are sequential. BERT allows users to select vectors from specific levels of the transformers and combine them in various ways. Another feature, and one that is relevant here, is that the user can choose to not pool the vectors at all, and this results in word vectors.

System Overview
As shown in Figure 1, this system is a stacked Bidirectional GRU ensemble. The first level uses cross validation on 11 different embeddings. Predictions of the first level are passed to an XGBoost regressor to get the final output. The results for sub-task 1 were directly used to decide which of the two headlines was funnier for sub-task 2, so the following system is designed simply to predict how funny the annotators thought the edited headlines were.

Data and Encoding
We regret that we were unable to improve performance using the Funlines data. In retrospect, the humor ratings have different distributions in the two datasets and it is possible that some kind of calibration was in order (Hossain et al., 2020a). We also left out the replaced word from the original title, and instead just encoded the edited titles with four off-the-shelf pretrained models.
Embedding-as-Service 1 was used to get FastText and Word2Vec vectors for the titles. FastText vectors came from models trained on Wiki News and Common Crawl, both with dimension 300. Word2Vec was trained on Google news, again with 300-dimensional vectors. These were very straightforward and performed consistently on our task. For BERT, Huggingface and HanXiao's Bert-as-Service were very helpful for quick encoding. 23 Additionally, we experimented with a Kaggle dataset called "All the News" whose articles were published shortly before the articles corresponding to the news headlines in Humicroedit. 4 We expected that starting from a BERT checkpoint and continuing pretraining on this news data would improve the performance but initial tests were not promising, so we dropped the idea. The best performance was achieved using no pooling, which gives something like word vectors. The new models with Whole-Word-Masking didn't do as well as the base and the large models, so we left them out. On our holdout set, it was unclear whether cased was better than uncased, so we used both. Finally, we also tried fine-tuning various BERT models on Humicroedit, but the results weren't as good as the BiGRU from Keras. The last language model we used was ELMo, for which we took advantage of the relevant Tensorflow Hub module. The LSTM1 output was consistently better than the rest, so that was used exclusively.
Power Transformation and the Standard Scaler from Scikit-Learn were used to normalize the data. While the latter was a bit better for individual models, they performed equally in the ensemble. For submission we used the Box Cox method of the Power Transformation.

Bidirectional GRU
Broadly, the first level consists of a Bidirectional GRU, some form of attention, a dense layer with RELU actiavtion and 50-dimensional output, and another dense layer with 1-dimensional output and hyperbolic tangent activation. Table 1 shows some of the relevant parameters. Two different forms of attention were used. For BERT and Elmo, attention vectors were calculated from the matrix output of the BiGRU. For FastText and Word2Vec, it worked better to use Global Average Pooling on the output of the BiGRU and use that as a context vector for attention on the matrix output of the BiGRU.
Other than this, the loss functions were also different: logcosh for FastText and Word2Vec versus mean squared error for BERT and Elmo. Both setups used the stock adam optimizer provided by Keras.
We used Keras to try an LSTM, a CNN, CNN-LSTM, multi-channel CNN, GRU, and a Capsule Net. We found that BiGRU's had the highest individual model performance and adding any other model reduced ensemble performance.
The hyperbolic tangent activation used on the output layer returns values between -1 and 1, so when we applied the reverse power transform, we were not using the full range (which was about -2.5 to 2.5 after scaling with the power transformation), but we found that re-scaling the output reduced the performance.

Ensemble Learning
We ran cross-validation on each embedding six times and saved the results from the model that performed the best. Training models on all eleven embeddings using 15-fold cross validation, and doing each six  Table 3: Top section has the individual model results including both bucket and overall performance. Heavy preprocessing removes punctuation; light does not. The bottom section has ensemble methods and the baseline. This is a different run than was submitted, so the scores are slightly different, but the models presented here are the same as used in the submission that placed third. The simple average just averages the predictions of the models. The baseline predicts the mean of the train set everytime.
times, resulted in a total runtime of about four hours on a Nvidia 2060 SUPER.
For each model, we concatenated the predictions on the folds left out of training. Then we concatenated these predictions from each model, and fed this to a few XGBoost regressors. We tried using AdaBoost, Random Forest, and SVR but XGBoost was consistently the best. Still, sometimes different hyperparameters performed better so we kept four of the best performing hyperparameter settings and selected the one with the highest score on the dev set at run time to make a prediction on the test set. Table 2 shows the parameters for the model that made the best predictions on the ensemble used in the competition.

Experimental Setup
For preprossessing, we remove things like " -The New York Times" or " -live updates" that somtimes appeared in the titles. This was all the preprocessing we did for about half of the models and it resulted in a title length of 30 tokens. For the other half, we removed all puncuation, resulting in 21-token titles. We experimented with this only a little, and perhaps left something on the table here.
We sorted the data in the train set so that each fold in cross validation had a distribution of y values that was representative of the whole train set. This seemed to make the scores more consistent over different runs, but did not noticeably improve them.
It may be worth noting that due to the competition rule that only the last submission would be counted, we never actually use all the available data for training. We didn't want to rely on a model whose performance was not verified on a holdout of the train set.
The organizers graded the first sub-task using root mean squared error and that was what we used throughout testing. A bucket performance metric was somewhat useful during development, it independently scores then bottom 10% (least funny) and top 10% (most funny) as the first bucket, the bottom 20% and the top 20% as the second bucket, etc.

Results and Discussion
Our model placed third in both sub-tasks with an RMSE of 0.51737 for the first and an accuracy of 0.65906 for the second. Over the course of this competition, much of our improvement was due to time consuming trial-and-error in composing the ensemble.
It is surprising to us that FastText and Word2Vec have individual model scores on par with the model using BERT encodings. In fact, as Table 1 shows, the best performing individual model comes from using FastText and the best model using Word2Vec performed better than the best of BERT, although it was only by as small margin.
The simple average of the results produced by each model was much better than any individual model, but not as good as XGBoost. The baseline of predicting the mean is actually in the ballpark of some individual models we tried.
To evaluate the utility of this project, it might help to compare any model with human performance, perhaps using the Funlines data, but we reserve this idea for later. Another idea to pursue is to make this into a classification task: humans, especially those who read news, would presumably be pretty good at deciding whether a title has been edited for the purpose of making it funny.

Conclusion
This project used a BiGRU ensemble with XGBoost to predict how funny Amazon Mechanical Turk annotators found edited news headlines. A third place finish on both subtasks of the related SemEval Task 7 was achieved largely by using a large number of models for encoding the titles, and by tuning an ensemble for this task.