Hitachi at SemEval-2020 Task 7: Stacking at Scale with Heterogeneous Language Models for Humor Recognition

This paper describes the winning system for SemEval-2020 task 7: Assessing Humor in Edited News Headlines. Our strategy is Stacking at Scale (SaS) with heterogeneous pre-trained language models (PLMs) such as BERT and GPT-2. SaS first performs fine-tuning on numbers of PLMs with various hyperparameters and then applies a powerful stacking ensemble on top of the fine-tuned PLMs. Our experimental results show that SaS outperforms a naive average ensemble, leveraging weaker PLMs as well as high-performing PLMs. Interestingly, the results show that SaS captured non-funny semantics. Consequently, the system was ranked 1st in all subtasks by significant margins compared with other systems.


Introduction
The recognition of humor in text has been receiving much attention (Barbieri and Saggion, 2014;Hossain et al., 2019). Accordingly, SemEval-2020 task 7, Assessing Humor in Edited News Headlines (Hossain et al., 2020a), which aims at automatically recognizing humor in hand-edited news headlines, was held with two subtasks: Subtask 1, which aims at predicting a funny score for an edited news headline, and Subtask 2, which aims at predicting the funnier headline of two given edited headlines.
In this paper, we pursue humor recognition with a large-scale stacking ensemble (hereafter Stacking at Scale or SaS), by leveraging pre-trained language models (PLMs). SaS is based on an ensemble method where a meta-estimator is trained to predict labels from the outputs of base models, finding the best combinations of the base models (Wolpert, 1992). Hence, there are two steps in SaS: (i) fine-tuning numbers of heterogeneous PLMs, including BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), RoBERTa (Liu et al., 2019), Transformer-XL , XLNet , and XLM (Lample and Conneau, 2019), with various hyperparameters, obtaining rich and diverse models, and (ii) training a meta-estimator on top of these PLMs.
Our experiments, fusing up to 1750 PLMs in total, indicate that SaS successfully leverages weaker PLMs as well as high-performing PLMs. Consequently, our system is ranked 1st on both subtasks with significant margins to others. Interestingly, analyses show that SaS learned (relatively) non-funny semantics while still struggling to understand the funniest semantics. To the best of our knowledge, this is the first experiment that involves thousands of diverse of PLMs, revealing the current strengths and limitations of PLMs in automatic humor recognition. We also provide useful insights obtained from rich analyses.

Background
Work related to humor recognition has been done in recent years (Khodak et al., 2018;Barbieri and Saggion, 2014;Reyes et al., 2012). Khodak et al. (2018) introduced a large-scale annotated corpus of sarcasm and provided baseline systems for sarcasm detection. Barbieri and Saggion (2014) widely investigated features for automatically detecting irony and humor. SemEval-2020 task 7 (Hossain et al., 2020a) aims at automatically detecting humor in hand-edited news headlines and was introduced by (Hossain et al., 2019). We worked to solve the problem by utilizing a number of PLMs with stacking.

Task Formalization
As we described in the above, Subtask 1 aims at predicting a "funny score", a real-value in the range of [0, 3] (0 = "Not", 1 = "Slightly", 2 = "Moderately", 3 = "Funny") for an edited headline. We formalized the task as a sentence-pair regression. Subtask 2 aims at predicting the funnier headline of two edited headlines originating from the same headline. We take an approach to utilizing the model of Subtask 1, that is, estimating the scores of the edited headlines and choosing the one having the higher score.  Figure 1: Overview of proposed model Figure 1 shows an overview of our proposed model architecture. Given a pair of edited and original headlines, we apply PLM, BiLSTM layers, a dot-product attention layer, a pooling layer, and a feed-forward layer successively to predict funny scores. Preprocessing: We concatenate two headlines. Tokenization is conducted by a PLM-specific tokenizer. We surround the edited tokens with two special marking tokens, "<" and ">." We insert special tokens (e.g., [CLS] and [SEP]) if necessary as required for each PLM. The implementations are described in detail in Section 6.1.

Intra-and Inter-Headline Encoding
To recognize inner-headline semantics, we first apply headline-wise multi-layered BiLSTM (Graves et al., 2013) as follows: are the PLM/BiLSTM representation of the i-th token and (start_edit, end_edit)/(start_origin, end_origin) represent the starting/ending positions of the edited/original headlines.
Next, h (BiLSTM) i are fed into the global dot-product-attention to capture inter-headline semantics, producing final hidden embeddings h i .

Funny Score Regression
We employ a headline-wise pooling layer and predict the funny score with a feed-forward network (FFN): where ⊕ is a concatenation operation. POOLING PLM is a PLM-specific embedding pooling function. For example, for BERT, it takes the embeddings of the first tokens of two headlines ("[CLS]" and "[SEP]"). The details are in Table 7 of Appendix B. We trained the model with mean squared error loss.

Stacking at Scale
We further propose large-scaled ensemble, called Stacking at Scale (SaS), based on a two-layer stacking ensemble (Wolpert, 1992), where the first-layer models (i.e., base models) are fine-tuned PLMs with different hyperparameter sets, and the second-layer model (i.e., meta-estimator) is another regression model. This may select the best combinations of the base models to produce more robust predictions. Figure 2 shows a schematic view and the algorithm steps of SaS. The key attributes are (i) using heterogeneous PLMs for base models, (ii) generating diverse hyperparameter sets for the base models,

Train Base Models
Train Meta Estimator Step 1: ∈ ℝ ×|ℳ| ← A concatenation of 1 … | | ∈ Step 2: ← Meta estimator trained on Apply Meta Estimator Input: Test data consists of samples For = 1 … |ℳ|: Step 1: ′ ∈ ℝ ← Predicted funny scores on test data with model ∈ ℳ  Figure 2: Simplified example of Stacking at Scale with 3-fold cross-validation and (iii) performing cross-validation (CV) during the whole process. CV is used for accumulating label leakage-free prediction data over the whole training dataset used for the meta-estimator training as well as measuring the accurate performances of models to select better base models. Since SaS requires enormous computations, discussions on complexity are given in Appendix A.

Base Model Hyperparameter Generation
We pursue diversity for the base models by generating numbers of various hyperparameter sets. To generate sets with reasonable performances in a relatively small number of trials, we utilize a hyperparameter optimization framework. It seeks the best hyperparameter set by performing an iterative search, (i) suggesting (possibly better) sets given the sets already found and their performances and (ii) measuring the performances of the newly suggested sets (see Train Base Models in Figure 2). The performance for each set is measured on the basis of mean squared errors (MSEs) averaged over k validation folds of CV.
Since our purpose here is not only to find the best hyperparameter sets but to collect diverse sets with reasonable performances, we keep all the sets suggested during the search. After the search, we select the top performing n sets from each PLM type (see Select Base Models in Figure 2).

Meta-Estimator Training and Inference
Since the non-linearity laid on the dataset could have already been captured by PLMs, we use simple linear regression models for the meta-estimator. Suppose that the scores for a headline predicted by N base models areŷ 1 , ...,ŷ N . The meta-estimator learns the weights w i in the linear regression problem by using MSE loss with some regularization term. The input dimensionality N is (# of PLM types × n) because we pick the top n hyperparameter sets for each PLM type. For example, for Train Meta Estimator in Figure 2, |M| = |S| = (# of PLM types × n).
Overall, to predict funny scores with SaS, we (i) feed a headline pair into (# of PLM types × n × k) base models, obtaining (# of PLM types×n×k) predictions in total, (ii) take the CV-wise average of the predictions, reducing the dimension to (# of PLM types × n), and (iii) feed the CV-averaged predictions into the meta-estimator to get the final prediction (see Apply Meta Estimator in Figure 2). 6 Experiments 6.1 Settings Offline Performance Measurements: Throughout our experiments, we estimated the performances of the models using the root mean squared error (RMSE) aggregated over the validation data of k=5 fold cross-validation 1 (hereafter RMSE-CV). Note that the aggregation is done over k=5 different sets of validation data, so we can measure the performances robustly.
Base Models: Table 1 shows the seven employed PLMs. We employed Optuna (Akiba et al., 2019) as the hyperparameter optimization framework. We generated 50 hyperparameter sets for each PLM. Therefore, in total, 350 models (= 7 types of PLMs × 50 sets of hyperparameters) or 1750 models including the CV variants (×5 CV-folds) were built for the experiments. Details of on the hyperparameters are given in Appendix B. We tried many choices of n ranging from 1 to 50 to minimize the RMSE-CV. Meta-Estimators: Two types of meta-estimators were employed: (i) Lasso regression (Tibshirani, 1996), i.e., linear regression with L1 regularization β|w|, and (ii) Ridge regression (Hoerl and Kennard, 1970), i.e., linear regression with L2 regularization β||w|| 2 . The strength parameter β was chosen from the default search values of scikit-learn (Pedregosa et al., 2011) to minimize the RMSE-CV. Data: We used Humicroedit (Hossain et al., 2019) and FunLines (Hossain et al., 2020b), which are distributed officially. They have the same data format; however, they have slightly different label distributions (Hossain et al., 2020b). The official splits (i.e., train and dev) of Humicroedit are all concatenated to a single dataset, on which the cross-validation folds are built.
For training folds, we used both relevant datasets, Humicroedit (Hossain et al., 2019) and FunLines (Hossain et al., 2020b), to maximally capture funny semantics. However, for validation folds, we used only Humicroedit because the test data instances were taken only from Humicroedit and we wanted to measure the approximate model performances on the test set. Implementation: We implemented the base models with jiant (Pruksachatkun et al., 2020), a transfer learning framework, which in turn utilizes Hugging Face's Transformers library (Wolf et al., 2019) for their implementation of PLMs, PLM-specific tokens (e.g., "[CLS]" and "[SEP]" for BERT), and a PLMspecific tokenizer. We implemented the meta-estimators using scikit-learn (Pedregosa et al., 2011). We employed the RidgeCV and LassoCV functions for Ridge/Lasso regressions. Both functions automatically find the best regularization strengths β. Computational Resource: We employed up to 800 Volta (16-GB) GPUs offered by ABCI 2 .

Results and Discussions
Official Ranking: We submitted the SaS-Ridge (n=20) system, i.e., SaS with the Ridge estimator using n=20 hyperparameter sets per PLM type, which performed the best in our pre-submission experiments. The model utilized 700 base models (=7 types of PLMs × 20 sets of hyperparameters × 5 CV-folds). The official ranking presented in Table 2 shows that our system is ranked 1st on both subtasks by significant margins to others. Hereafter, we analyze our system using Subtask 1 since we tuned our systems on it. How Powerful is SaS?: We show ablation results for each PLM for n = 1 systems in Table 3. Most of the stacking models (shown as "SaS") performed better than single models ("single"), showing the effectiveness of fusing heterogeneous PLMs. Removing a PLM from SaS almost always degrades the performance regardless of the native performance of the removed model. This implies that not only the strongest PLMs but also the weaker PLMs are important for SaS. Hereafter, we use the single RoBERTa model as our baseline, which is the strongest model among the single models. Note that this baseline is competitive since it is with the best hyperparameter found in the 50-step hyperparameter optimization. Figure 3 shows the change in performance for the total number of base models without CV variants (i.e., 7 types of PLMs × n). SaS-Ridge achieved its best performance around 100 models, and SaS-Lasso 1 Let MSEi be the MSE and ni the number of instances for the ith-fold validation data. The aggregated RMSE is as follows. PLM model BERT (Devlin et al., 2019) large-uncased GPT-2 (Radford et al., 2019) medium / large RoBERTa (Liu et al., 2019) large Transformer-XL  wt103 XLNet  large-cased XLM (Lample and Conneau, 2019) en-2048     PLM-wise sum of absolute weights ( i∈PLM |w i |) of best SaS models, i.e., SaS-Lasso (n=50) and SaS-Ridge (n=20). Models were trained on our k=5 CV's training data. Values shown are averages over CV variant models. kept its performance high over the stacking of 100 models, while the naive average ensemble got worse. This implies that at least nearly or over 100 PLMs are required to achieve the best performance for SaS. Also, SaS successfully utilized weaker models without harming the performance, while the naive average ensemble failed in that. To validate this, we plotted the numbers of active weights (i.e., the number of w i (i ≥ 1) in eq. (1) that meet the condition |w i | ≥ threshold(0.01)) in Figure 4. Since Lasso is a sparse linear model, it constantly activated 80 to 100 PLMs, while Ridge's active weights increased linearly. The result indicates that utilizing the sparse model can automatically adjust the number of PLMs to be used.
Which Type of PLM is Useful?: We obtained contribution scores for each PLM type via the metaestimator's weights. Figure 5 shows the PLM-wise sums of absolute weights; i∈PLM |w i |, where  Table 4: Some sample headlines on which our best system, SaS-Lasso (n=50), reduced absolute errors by large margins (top) and by small (or sometimes negative) margins (bottom). Besides headlines, we show gold funny score ("gold"), prediction made by our system ("SaS"), with single RoBERTa ("RoBERTa"), and error reduction over baseline RoBERTa ("reduction").
w i are the weights in eq. (1). RoBERTa and GPT-2 seemed to be the most preferable models, consistent with the results of excluding the models shown in Table 3 (shown as w/o). However, the plot also indicates that the stacking succeeded in leveraging weaker models as well as the best models. What Did SaS Solve?: Figure 6 shows a sample distribution over the gold funny scores (top) and the mean absolute error (i.e., e SaS = |ŷ meta − y gold |) and the mean absolute error reductions (i.e., e RoBERTa − e SaS ) over the single RoBERTa baseline (bottom). SaS improved performance for not-to slightly-funny ([0-1.5]) headlines, while having similar or degraded performance for the funnier ([1.5-3.0]) headlines. In short, SaS learned (relatively) non-funny semantics. Since these headlines are the majority, SaS also gained overall performance improvements. Case Study: Why Did SaS Learn Non-Funny Semantics? Table 4 shows sample headlines. The top rows show samples on which our best system, SaS-Lasso (n=50), reduced the errors over the single RoBERTa baseline by large margins.
These headlines had small funny scores, and it seems that we can understand the non-funniness from the headline text itself without needing much external knowledge, and, in particular, some of the funniness comes only from the bizarreness or incongruity of the edited headlines. It is natural for PLMs to detect these types of non-funniness because they are trained on large amounts of corpora and could have learned to detect the unnaturalness of the given texts. We estimate that SaS enhanced this ability by combining the heterogeneous PLMs. The bottom rows show headlines with small (or sometimes negative) error reductions. These headlines had large funny scores and seemed to be expressing irony. Irony does not express intentions directly in text and rather relies on a reader's inference using sufficient common sense or background knowledge, especially on current topics. Given that SaS could have chosen the best combination of PLMs and that even SaS had no performance gain for such headlines, it is likely that such knowledge is not contained in any of the PLMs. This suggests the current limitation of PLMs on the humor recognition tasks.

Conclusion
In this paper, we proposed a top performing model for the task of humor recognition. We fused thousands of pre-trained language models by Stacking at Scale. Experimental results showed the incredible performance of the Stacking at Scale, and at the same time, also revealed the current limitation of pre-trained language models. For future work, we will explore injecting common sense or background knowledge into models to understand humor better.

A Time Complexity of Stacking at Scale
In this section, we discuss the time complexity of the Stacking at Scale (SaS) algorithm. We (i) first induce the theoretical time complexity of SaS and (ii) show measurements of the actual running time observed in our experiments, which is in accordance with those predicted by the theory.
The discussions are not that rigorous or exhaustive; however, we believe they are enough to offer readers rough estimations of the time complexity of SaS.

A.1 Theoretical Expressions
We estimate the time complexity of SaS, expressed by that of a single base-model system. The training phase complexity [eq. (2)] and the inference phase complexity [eq. (3)] are induced. In both cases, the dominant term comes from the base-model hyperparameter generation or inference. Thus, the SaS time complexity is (# of base models engaged in a phase) times larger than that of a single base-model system.
We first decompose the SaS algorithm into several steps and induce the time complexity of each step independently. Then, we aggregate the complexities to calculate the overall complexity of SaS.

A.1.1 Base-Model Hyperparameter Generation
Let τ base train (D train ) be the time needed to train a single base model on the training data D train with a specific setup (say, a specific PLM type, number of epochs, specific machine resource used, etc.). Let N train be the number of base models (i.e., the number of unique hyperparameter sets) to be trained. The time complexity of base-model hyperparameter generation T base train (D train ) is estimated as follows. T base train (D train ) = N train × τ base train (D train ) Referring to Figure 2, N train is expressed as: where P is the number of PLM types, B the hyperparameter-optimization step budget per PLM type, and k the number of cross-validation folds.

A.1.2 Base Model Inference
Let τ base infer (D test ) be the time needed to execute inference with a single base model over the test data D test with a specific setup. The time complexity of base model inference T base infer (D test ) is estimated as follows. T base infer (D test ) = N infer × τ base infer (D test ) Let n(≤ B) the number of models per PLM type that are engaged in SaS. Then, N infer is expressed as follows.
N infer = P × n × k

A.1.3 Meta-Estimator Training
Since the inputs of the meta-estimators are the predictions over the training data D train made by the base models, we must execute the inference of the base model over the training data D train beforehand. Therefore, the time complexity of meta-estimator training T meta train is expressed as: T meta train (D train ) = T base infer (D train ) + τ meta train (D train ), where τ meta train (D train ) is the time needed to train a meta-estimator with a specific setup.

A.1.4 Meta-Estimator Inference
The time complexity of meta-estimator inference T meta infer is expressed as: is the time needed to execute the inference of the meta-estimator for a specific setup.

A.1.5 Overall SaS Training
Overall, the time complexity of SaS training T SaS train (D train ) is as follows. T SaS train (D train ) = T base train (D train ) + T meta train (D train ) = P Bk × τ base train (D train ) + P nk × τ base infer (D train ) + τ meta train (D train ) τ base train (D train ) In many cases, the second term in the brackets is negligible given that n B ≤ 1 and that τ base infer (D train )/τ base train (D train ) 1 often holds since, (i) in the training phase, we iterate over the dataset for several times, while, in the inference phase, we iterate only once, and, (ii) in the training phase, we need to back-propagate the gradients, while, in the inference, we do not. The third term can be negligible in the case where there are numbers of base models to train (N train 1) or the meta-estimator is "lighter" than the base models (τ meta train (D train )/τ base train (D train ) 1; this indeed holds for our experiments since the base models are large neural networks, while the meta-estimators are just linear regressions. Thus, T SaS train (D train ) can be approximated only by the first term (i.e., base model training) as follows.
Thus, the overall training complexity of SaS is PBk times larger than that of a base model.

A.1.6 Overall SaS Inference
The time complexity of SaS inference T SaS infer (D test ) is the same as that of the meta-estimator's inference Again, the second term can be negligible in the case where there are numbers of base models engaged in SaS (N infer 1) or the meta-estimator is lighter than the base models (τ meta train (D test )/τ base train (D test ) 1). Thus, T SaS infer (D test ) can be approximated only by the first term (i.e., base model inference) as follows.
Thus, the overall inference complexity of SaS is Pnk times larger than that of a base model.

A.2 Measurements of Running Times
In this section, we show the observed running times of the SaS algorithm. Please note that the results are not rigorous or exhaustive. The purpose here is rather to offer readers a taste of the order estimations of the computational time and resources needed to reproduce the SaS experiments.

A.2.1 Measurement of τ
We reposit the setting of the experiments in Table 5. For computational resources, we trained our base models using Volta (16-GB) GPUs (single model per single GPU). Some large models [i.e., GPT-2 (L) and XLM] were trained on Volta (32-GB) GPUs. On average, the training time seemed to be about 30 minutes, that is: τ base train (D train ) ∼ 0.5 hours, Note that this estimation is really rough since τ base train (D train ) depends on many factors including the PLM type, number of training epochs (mostly 8 or 16), batch size (1 to 16), and that the above value is only the average (or "marginal") actual running times.
With the same setting as the training, the inference time was observed to be: τ base infer (D test ) ∼ 20 secs, which is much smaller than th training time, as expected. parameter value P 7 Seven types of PLM are used. (see Table 1) B 50 50 hyperparameter sets per PLM types are generated. k 5 5-fold cross-validation is employed. n 50 Our best system, Lasso(n = 50) uses 50 base models per PLM type.

Dtest
Official test data About 3k instances. In our setting, the number of models trained (N train ) was as follows.
Thus, the total training time could be estimated theoretically as follows.
T base train (D train ) = N train × τ base train (D train ) ∼ 1750 × 0.5 hours = 875 hours As we observed, with 200 Volta GPUs, it took ∼ 5 hours to train the whole SaS model, which is of the same order as the theoretically predicted training time.
The number of models engaged in the ensemble (N infer ) was estimated as follows.
N infer = P × n × k = 7 × 50 × 5 = 1750 We have not measured the total inference time since the inference was executed at the same time as the training with our implementation. Therefore, we show only the theoretically expected inference time.

B Base-Model Hyperparameter Generation
In this section, we describe the setup in detail and the results of the base-model hyperparameter generation.

B.1 Setup
We generated hyperparameter sets using Optuna (Akiba et al., 2019), a hyperparameter optimization framework. We used version 1.10. We started each optimization process using Optuna's default seed.
For each PLM type, we generated 50 hyperparameter sets. At each step of the optimization process, we tried 5 hyperparameter sets in parallel. Therefore, in total, 10 steps were needed to try 50 hyperparameter sets. Table 6 and Table 7 show the specific hyperparameter setups. Table 6 shows the hyperparameters and their (i) search range, (ii) the initial values, and (iii) the Optuna sampling functions used.   (Devlin et al., 2019) large-uncased 1 first token GPT-2 (Radford et al., 2019) medium / large 16 / 2 last token RoBERTa (Liu et al., 2019) large 16 average Transformer-XL  wt103 4 average XLNet  large-cased 16 last token XLM (Lample and Conneau, 2019) en-2048 1 average Table 7: PLM-specific fixed hyperparameters. Batch size and embedding pooling function mentioned in Section 4.2 are shown. "First token" takes first token embedding from headline, "last token" takes last token embedding, and "average" takes average of all token embeddings in headline.