Smash at SemEval-2020 Task 7: Optimizing the Hyperparameters of ERNIE 2.0 for Humor Ranking and Rating

The use of pre-trained language models such as BERT and ULMFiT has become increasingly popular in shared tasks, due to their powerful language modelling capabilities. Our entry to SemEval uses ERNIE 2.0, a language model which is pre-trained on a large number of tasks to enrich the semantic and syntactic information learned. ERNIE’s knowledge masking pre-training task is a unique method for learning about named entities, and we hypothesise that it may be of use in a dataset which is built on news headlines and which contains many named entities. We optimize the hyperparameters in a regression and classification model and find that the hyperparameters we selected helped to make bigger gains in the classification model than the regression model.


Introduction
Verbal humor uses a variety of linguistic features, such as synonymy, wordplay, and phonological similiarities, as well non-linguistic features like world knowledge, to produce a comic effect. That such a broad set of skills are required to understand humor, has led several researchers to deem that computational humor is an AI-complete problem Binsted et al., 2006). There is a relatively longstanding body of research into humor detection in a limited domain, such as knock-knock jokes (Taylor and Mazlack, 2004), one-liners (Mihalcea and Strapparava, 2006) and humorous news articles from the satirical news publication The Onion (Mihalcea and Pulman, 2007). However, the use of shared tasks has attracted more attention and interest in the field since 2017. While previous challenges have focused on collecting Twitter data (Potash et al., 2017;Castro et al., 2018), SemEval 2020 (Hossain et al., 2020) took an original approach and generated the data by collecting news headlines and then asking annotators to edit one word in the headline to make it humorous (Hossain et al., 2019). These headlines emulate those of The Onion. The edits shown below indicate the location of the substitution and the word to be inserted. The edited headlines were then rated for humor by subsequent annotators. Sub-task A was to predict the mean funniness score of the edited headline. In sub-task B, the systems saw two edits of the same headline, and predicted which one had achieved the higher mean funniness score. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/.

Previous Work
Excluding work on puns, there have been three humor detection shared tasks in recent years: Semeval 2017 (Potash et al., 2017), HAHA 2018 (Castro et al., 2018) and HAHA 2019 (Chiruzzo et al., 2019). As the tasks and data have varied between them, direct comparison is not possible. However, a comparison of approaches to the tasks shows some interesting trends.
Semeval 2017's entries were evenly divided between feature engineering approaches and deep learning systems, with both achieving competitive results. The highest ranking team in the official results for task A, SVNIT (Mahajan and Zaveri, 2017), used an SVM with incongruity, ambiguity and stylistic features. The second highest-ranking team, Datastories (Baziotis et al., 2017) opted for a Siamese bi-LSTM with attention. Interestingly, a remarkably simple system prevailed in task B: Duluth (Yan and Pedersen, 2017) used the probability assigned to the text by a bigram language model instead of the output of a classifier to make predictions.
Entries to HAHA 2018 were divided along similar lines. The winning system used Naive Bayes and ridge regression models optimized with an evolutionary algorithm (Ortiz-Bejar et al., 2018) with the runner up using a bi-LSTM with attention (Ortega-Bueno et al., 2018).
HAHA 2019 saw a sea change towards the use of transfer learning models, such as BERT (Devlin et al., 2018) and ULMFiT (Howard and Ruder, 2018). These models leverage large amounts of data and transformer attention models to learn contextual relations between words. Adilism (Ismailov, 2019) used multilingual BERT base uncased and extended the language model training without labels, before finetuning their system with the dataset labels. The second place system used an ensemble of a BERT model and ULMFiT, with Naive Bayes and SVM classifiers. The majority of the top entries to this task used BERT in some way, although one noted that it did not improve performance as expected (Ortega-Bueno et al., 2019).
As BERT models are trained on a masked-language model and sentence prediction task, they capture mainly word-level and sentence-level information. By comparison, ERNIE 2.0 (Sun et al., 2020) -henceforth ERNIE -aims to capture more lexical, syntactic and semantic information in corpora, by training on eight different tasks in a continual pre-training framework. Knowledge masking features among these eight tasks, and is implemented by treating a phrase or entity as an entire unit, instead of masking the constituent words. The distinction in how BERT and ERNIE learn is illustrated in how they learn the following sentence: Harry Potter is a series of fantasy novels written by J. K. Rowling BERT captures co-occurrence information of 'J' with 'K' and 'Rowling', however it does not capture information about the entity J. K. Rowling. By modelling this entity as a single unit, ERNIE claims to be capable of extrapolating the relationship between Harry Potter and J. K. Rowling (Sun et al., 2019). Furthermore, ERNIE is trained on a wide varieties of domains, including encyclopedias and news articles, giving the model a lot of knowledge of named entities. This is of great benefit in the Funlines dataset, which is built on news headlines, and therefore features a large number of named entities, particularly politicians. This may help the model to infer the relationship between Mitch McConnell and Trump in the example from table 1.

Text Preprocessing
The dataset featured the original headline, with the word which had been replaced in angle brackets, and the substitute word separate. We rendered the edited headlines by placing the word in angle brackets into the sentence. This did not give our model access to the keyword, or to the original headlines.
For ERNIE models, we preprocessed the data as follows: We lowercased the texts and tokenized them into word pieces, this was implemented with a greedy longest-match-first system to tokenize them given the vocabulary. As is conventional for ERNIE, we then added a [CLS] token to the start of each text, and a [SEP] token to the end of each text, with an additional [SEP] replacing the [CLS] in the second text for pairs of texts (e.g. task 2). We also padded sequences to a maximum length of 128.

Baseline
For task 1, we create two baselines, one which predicted a constant value, and the other which predicted the mean value, using scikit learn (Pedregosa et al., 2011). For task 2, we created three baselines. In the first, we always predicted the same label. The second baseline was a trigram language model built on KenLM (Heafield, 2011), using a dataset containing around 200,000 news headlines from 2012-2018 editions of the Huffington Post 1 . Similarly to the approach taken by the Duluth team (Yan and Pedersen, 2017) in SemEval 2017, we reasoned that the funnier of the two headlines would be the least similar to real news headlines, so we selected the sentence that had a lower log probability according to the model. However, this performed worse than the first baseline.
The third baseline was a trigram model built the headlines labelled as sarcastic from a sarcastic news dataset (Misra and Arora, 2019). These headlines came from The Onion, which the competition dataset seeks to emulate. Here we reasoned that the funnier headline would have a higher log probability under this language model. Predicting labels in this way was an improvement over the other two baselines, suggesting that the unique data generation methods in this challenge succeeded in emulating satirical headlines in some way.

Model Configuration
For the transfer learning models, we used ERNIE base which has 12 layers, a hidden size of 768 and 12 self-attention heads. We used a maximum sequence length of 128, a dropout probability of 0.1 and the Adam optimizer. To finetune for task 1, we built a fully connected layer with mean square error as the loss function. For task 2, after the fully connected layer, we added a softmax layer and used cross entropy as the loss function.

Experiments
We experimented with optimizing three hyperparameters: learning rate (1e-06, 0.0001 or 0.001), batch size (16, 32 or 64) and number of epochs (3, 4 or 5). For the sake of brevity, we report only the three highest and lowest results for each task. The results reported are the mean of 5 runs, with standard deviation in parentheses. We noticed remarkably little variation in the task 1 results, regardless of the hyperparameter tweaking. Given that the same learning rate is observed in both high and low-scoring systems, and that there is no observable pattern in terms of batch size, this suggests that another hyperparameter, or variable may help to achieve better results. By contrast, in task 2, we saw much more variation, with a jump of almost 11% from the lowest to the highest-scoring configuration. A small learning rate of 0.0001, along with a relatively large batch size of 64 featured in all three top results, and the number of epochs was decisive, bringing a 5% increase over at the optimal number -4. We observed that the lowest learning rate also achieved the lowest scores. However, with too small a learning rate, the network appears not to converge, and varying the other hyperparameters does not impact this.

Conclusion
While transfer learning models have achieved very impressive results on a variety of NLP tasks, the performance on this humor task was not as high as anticipated. Perhaps in a multi-task learning setup, we may have seen better performance. Nonetheless, our work demonstrates the importance of optimizing the hyperparamters of the finetuning layers, which achieved improvements on both tasks, but specifically the classification task.