Hitachi at SemEval-2020 Task 10: Emphasis Distribution Fusion on Fine-Tuned Language Models

This paper shows our system for SemEval-2020 task 10, Emphasis Selection for Written Text in Visual Media. Our strategy is two-fold. First, we propose fine-tuning many pre-trained language models, predicting an emphasis probability distribution over tokens. Then, we propose stacking a trainable distribution fusion DistFuse system to fuse the predictions of the fine-tuned models. Experimental results show tha DistFuse is comparable or better when compared with a naive average ensemble. As a result, we were ranked 2nd amongst 31 teams.


Introduction
This paper presents our strategy for SemEval-2020 task 10, Emphasis Selection for Written Text in Visual Media (Shirani et al., 2020). The task is aimed at emphasis selection, choosing candidates for emphasis in short written text in visual media (Shirani et al., 2019). Rather than predicting emphasis spans or using images, the task involves the prediction of an emphasis distribution over a short text without any image inputs.

BERT T5 GPT-2 RoBERTa
Fine-tune and hyperparameter search with cross validation ・・・ Nothing is so contagiou s as example .
Input: Nothing is so contagious as example . We tackle the task by combining rich contextualized embeddings of many fine-tuned pre-trained language models (PLMs). Our strategy, shown in Figure 1, is a simple but effective meta ensemble method. First, we fine-tune heterogeneous PLMs such as BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), predicting an emphasis distribution over tokens. The models are trained with the KL-divergence of gold-emphasis distributions over tokens in a text fragment to handle the ambiguity within annotations (Shirani et al., 2019). Second, we propose distribution fusion (DISTFUSE) for ensembles. Different from an averaging ensemble, DISTFUSE assigns a kind of reliability weight to each distribution. Hence, accuracy can be improved with the method.
We evaluate the proposed system in large-scale experiments that suggest that DISTFUSE is comparable or better when compared with the average ensemble. As a result, our system ranked 2nd amongst 31 teams. We also provide interesting insights such as on the distinct performance of PLMs, training techniques, and hyperparameters in the results section.

Background
The modeling of word emphasis has been widely tackled in some contexts. Zhang et al. (2016) proposed a model for extracting key phrases from Twitter text. In the context of accent, Nakajima et al. (2014)  predicted emphasized phrases from Japanese advertisement text for text-to-speech synthesis. In this shared task, however, the form of emphasis is different. Shirani et al. (2019) provided a corpus with emphasized tokens based on BIO labels. The authors kept inter-subjectivity in the annotator as well as the ambiguity of the input rather than deciding gold spans. To this end, how many annotators marked a token emphasized was recorded, producing an emphasized probability over tokens.

Fine-Tuning PLM for Emphasis Selection
Given a tokenized text τ , we fine-tune a PLM to predict emphasis distributions over tokens. In this study, we employ seven state-of-the-art PLMs, BERT ( (Lample and Conneau, 2019), and T5 (Raffel et al., 2019). Therefore, PLM ∈ {BERT, GPT-2, RoBERTa, XLM-RoBERTa, XLNet, XLM, T5}. We obtain the PLM embedding of the i-th word token with a layer-wise attention (Kondratyuk and Straka, 2019): where c and s are learnable parameters, and PLM τ,ij is an embedding of the i-th word token 1 in the j-th layer of the PLM in the text τ . To further obtain rich features, we add part-of-speech embeddings (e POS τ,i ) and token embeddings from a character-level LSTM (e char τ,i ). Hence, the input representation of the i-th word token is represented as e τ,i = e PLM τ,i ⊕ e POS τ,i ⊕ e char τ,i , where ⊕ is a concatenate operation. Then, we obtain the emphasis distribution associated with token i by simply taking feed forward networks (FFNs): where w and b are learnable parameters, and N τ is the number of tokens in the text.

Cross-Validation and Model Selection
To generate better models, we fine-tune PLMs with different hyperparameter sets (e.g., learning rates and dropout ratios) as shown in Figure 2. In the training, there are four steps:

Distribution Fusion (DISTFUSE)
To fuse the fine-tuned PLMs, we present DISTFUSE, which utilizes the meta-information of output emphasis distributions. Let {h 1 , h 2 , . . . } be a set of the combinations of the top-performing hyperparameter set and PLM (e.g., h 1 = (hyperparameter set 1, BERT)) andd τ,h i ∈ R Nτ be an output emphasis distribution for h i . DISTFUSE assigns to each distribution a kind of reliability weight, fusing them all: where s (fuse) is a tunable parameter. If softmax (s (fuse) ) i is larger, the network considers the distribution d τ,h i to be more reliable, and vice versa. We also incorporate mean pooledd τ,mean , max pooledd τ,max , and min pooledd τ,min distributions to the input for stable training. Finally, KL-Div loss is employed to train DISTFUSE:

Experiments
Implementation: Seven PLMs, shown in Table 1, were provided. We implemented models with PyTorch (Paszke et al., 2019) and Hugging Face's transformer library (Wolf et al., 2019). The learnable parameters in the models were split into two groups (Kondratyuk and Straka, 2019), one for the PLM parameters and one for all other non PLM parameters, assigning a different optimizer for     each group. We froze PLM parameters for the first epoch to improve the training stability (Kondratyuk and Straka, 2019). Layer attentions were applied for the last eight layers for all PLMs, employing dropout. We applied linear warmup for learning rate scheduling with Adam (Kingma and Ba, 2015). For DISTFUSE, we employed SGD, decaying the learning rate every step.
Hyperparameter sets including learning rates and dropout ratios were generated by the Optuna framework (Akiba et al., 2019). The optimal learning rates are described in the results section. The rest of the fixed hyperparameters can be found in Table 2. We generated 40 hyperparameter sets for each PLM, and the top 3 sets for each PLM were selected for DISTFUSE.
We report the results for both the development and test sets. In the training to predict the test set, we incorporated the development set into the training set. Metric: Systems were evaluated with Match m (Shirani et al., 2019) defined as: m is based on the system's top m probabilities. Table 3 presents the official test results, showing that our system is ranked 2nd. The table also shows that our system performed well when m = 1, implying effectiveness in detecting the most emphasized word.

Results
When trained only on the training set without the development set, we obtained a total score of 81.2 (i.e., for each m, 71.1, 80.5, 85.3, and 88.0), showing that our model was effective even when the amount of training data was smaller. Analyses of PLMs: To show how each PLM worked, Table 4 shows the independent performance of each PLM with the top 3 hyperparameter sets. As can be seen, the BERT and XLNet models generally performed well. Interestingly, the table also shows that the PLMs themselves were not as strong as our final model (i.e., fusing all PLM types) in most cases. This suggests that using heterogeneous PLMs can , where X-axis shows learning rate for non-PLM parameters, and Y -axis shows learning rate for PLM parameters. Each point indicates searched hyperparameter set. Note that darker colors indicate better performance, and note that scale used for these graphs differs.
PLM parameter non PLM parameter BERT 1.28 × 10 −5 1.46 × 10 −4 GPT-2 6.78 × 10 −6 3.91 × 10 −4 RoBERTa 6.95 × 10 −6 4.12 × 10 −4 XLM-RoBERTa 3.00 × 10 −6 9.92 × 10 −5 XLNet 4.29 × 10 −6 9.29 × 10 −5 XLM 6.29 × 10 −6 2.29 × 10 −4 T5 1.99 × 10 −5 2.68 × 10 −4  We visualized the layer-wise weight of the fine-tuned PLMs in Figure 3, showing that most weighted layers were generally found in the last several layers. However, there was a high variance, e.g., XLM and XLNet were less weighted in the last layers, and T5 had a higher up-down property. Analyses of DISTFUSE: Table 5 compares the performance between DISTFUSE and an average ensemble. As can be seen, the proposed DISTFUSE consistently showed comparative or better performance. The result suggests that DISTFUSE is promising in terms of boosting performance. Figure 4 illustrates the weight parameter s (fuse) of DISTFUSE, interestingly showing that the min and max pooled inputs were the most important elements. We estimate that this is because max and min pooled elements incorporate the features of the most or least emphasized information. Also, we can see that strong PLMs such as XLM-RoBERTa and XLNet were more weighted than XLM and GPT-2. We estimate that the weight assignment ability of DISTFUSE made more robust predictions. Meta-Insights: Our in-depth analyses showed that tuning the learning rates of each PLM is important. Figure 5 visualizes the learning rate space for the two parameter groups. The figure shows that there are definitely optimal points in the learning rate for both the PLM and non-PLM groups. For example, the optimal rates of BERT were mostly found in the upper left.
We show the optimal learning rates in Table 6. XLM-RoBERTa and XLNet had relatively smaller learning rates, while BERT and T5 had larger rates. The table also shows that the learning rates of the non-PLM parameters were larger than the PLM parameters. This insight suggests that tuning the two groups independently could be effective.
Case Study: Table 7 shows an example of the output emphasis rankings for Life is a succession of lessons which must be lived to be understood in a validation fold. Most of the PLMs could predict that the most emphasized words would be Life and lessons, showing the promising capability of PLMs. The table also shows that each PLM had different outputs. For example, while succession was strongly emphasized by GPT-2, the other PLMs did not emphasize it. We can also see that some of the models captured less emphasized tokens such as lived and understand.

Conclusion
In this paper, we proposed a model for the task of emphasis selection. We employed seven pre-trained language models and fused them with the distribution fusion (DISTFUSE) system. Experimental results suggested that DISTFUSE is promising in terms of boosting performance. We estimate that the effectiveness of DISTFUSE would be further validated by additional analyses (Dodge et al., 2019), which is future work. As additional future work, we will examine more effective ways of computing distributions.