IDS at SemEval-2020 Task 10: Does Pre-trained Language Model Know What to Emphasize?

We propose a novel method that enables us to determine words that deserve to be emphasized from written text in visual media, relying only on the information from the self-attention distributions of pre-trained language models (PLMs). With extensive experiments and analyses, we show that 1) our zero-shot approach is superior to a reasonable baseline that adopts TF-IDF and that 2) there exist several attention heads in PLMs specialized for emphasis selection, confirming that PLMs are capable of recognizing important words in sentences.


Introduction
In terms of visual communication such as social media posts (posters, flyers, and ads) and motivational messages, text emphasis is crucial in that it facilitates the comprehension of written text and that it helps convey the author's intent. Therefore, it is expected that the development of an automatic system that recommends which part to emphasize in the text of visual media will bring significant advantages; e.g., it is possible to accelerate the making process of posters and videos for advertisement. In this paper, we attempt to devise such a system solely with the aid of pre-trained language models (PLMs) without any task-specific laborious training.
Recently, there has been a substantial amount of work in the literature to figure out what knowledge Transformer (Vaswani et al., 2017) based PLMs such as BERT (Devlin et al., 2019) contain and why they perform surprisingly well on various downstream tasks (Goldberg, 2019;Kovaleva et al., 2019;Rogers et al., 2020). Among these, a group of studies has focused on analyzing PLMs' self-attention distributions to find some evidence that supports the existence of linguistic knowledge within the pre-trained weights of PLMs. Specifically, Clark et al. (2019) investigated BERT's individual attention heads to probe its ability to parse dependency trees, while Kim et al. (2020) proposed an unsupervised constituency parsing method applicable on top of the distributions.
Following the same philosophy shared by the work mentioned above, we propose a zero-shot emphasis selection method, assuming that during pre-training, some attention heads in PLMs are inspired to recognize which words are more important than others. We test our method on a carefully designed dataset (Shirani et al., 2020), and it records the ranking score 0.690 on the validation set, which outperforms an intuitive baseline adopting the term frequency-inverse document frequency (TF-IDF) strategy (Jones, 1972). Furthermore, our method operates in a fully zero-shot manner, not leveraging gold-standard annotations provided by the dataset at all, implying its universal applicability in a low/zero-resource regime.

Task
This paper aims to present a tractable solution to the SemEval 2020 shared task 10 (Shirani et al., 2020). The dataset provided by this task consists of short English sentences obtained from Adobe Spark. It contains a variety of subjects featured in flyers, posters, and advertisements or motivational memes on social media-for example, "In honor of the brave" (Shirani et al., 2019 Table 1: Data structure of the example sentence "In honor of the brave". A1-A9 are nine annotators. I and O correspond to whether to emphasize the word or not. Emphasis frequency, e f req, is the average of nine labels of A1-A9. Figure 1: Sample attention map of the example sentence. In detail, as shown in Table 1, each word of a sentence in the dataset is provided with nine binary labels (I/O tags) that correspond to the decisions of nine annotators, each indicating whether to emphasize the target word or not. Here we define the emphasis frequency (e f req) of the word in the sentence as the average of these nine labels, where treating 'I' as 1 and 'O' as 0. Our goal is to construct a model that predicts the correct ranking of words based on each word's gold emphasis frequency.
For evaluation, we use the M atch m (m ∈ [1, 2, 3, 4]) score which is calculated as: m is a set of top-m high emphasis frequency words in an input sentence x of a dataset D.; e.g., S ("In honor of the brave ) 2 m is a set of top-m high emphasis frequency words based on our model's prediction. | · | corresponds to the number of elements in a set. Moreover, we introduce the Ranking Score as an aggregated measure that averages all possible M atch m scores:

Method
In this section, we propose three ways to induce the emphasis frequency (e f req) of each word in a sentence using a PLM's 1 attention maps. The intuition behind our approach is that the more one word draws attention from other words, the more suitable this word as a target to be emphasized. In other words, the words contribute the most to construct the intermediate representations of other words for the next layer of PLMs should have high e f req values. As we only resort to the inherent knowledge of PLMs rather than learning a separate model from scratch, we do not need any further training based on supervision from gold-standard annotations to implement our approach. This characteristic is attractive in a perspective that the annotations required to build gold-standard labels are too expensive and even somewhat subjective.
To reduce the ambiguity, here we define terminology. We denote a sentence as a set of words, s = {w m |m = 1, . . . , n}, where n stands for the number of words in the sentence. When a sentence s is fed to a PLM, an attention map, which is a set of attention distributions of a particular self-attention head, can be extracted. We define G as a set of attention maps extracted from a PLM; i.e., G = {g (i,j) ∈ R (n+2)×(n+2) |i = 1, . . . , l, j = 1, . . . , a}, where g (i,j) is an attention map of the jth attention head on the ith layer, and l and a are the numbers of layers and attention heads per layer, respectively. The reason why we add 2 to n is to consider two special tokens, [CLS] and [SEP]. In Figure 1, there is a sample attention map of the example sentence, where each row represents an attention distribution of the corresponding word; e.g., The 3 rd row is an attention distribution of the word 'honor' over other words. There are some pre-processing procedures to obtain proper attention maps which can be utilized for our methods. First, we add special tokens [CLS] and [SEP] to an input sentence.-for example, " [CLS] In honor of the brave [SEP]" as described in Devlin et al. (2019). Then, we can extract attention maps after feeding an input sentence to a PLM. Since most PLMs tokenize a sentence into subword-level, we convert token-level attention maps to word-level attention maps by averaging each group of the attention weights of subword tokens that belong to the same word following Clark et al. (2019) and Kim et al. (2020).
For each attention head, we consider three options to derive e f req(word): Words2Target, CLS2Target and SEP2Target. Each option is depicted in Figure 2.

Words2Target
Given a particular attention head (the jth attention head on the ith layer extracted from a PLM), the emphasis frequency of a word can be calculated as an average over other words' attention weights on the target word: This equation has a meaning of how much the tth word, word t , is influential when constructing the hidden representations of other words. It also can be thought of as the average of values on the tth column of the attention map as shown in Figure 2 -(a).

CLS2Target and SEP2Target
Including BERT, many PLMs use special tokens ([CLS], [SEP]) to encode a sentence representation or the relation of two input sentences. The attention weight of a special token on word t means how much the word t contributes to sentence representation. Thus, when a PLM is given an input sequence ([CLS], w 1 , w 2 , . . . , w n , [SEP]), we induce 2 e f req(word t ) from the attention weight of word t on both special tokens, as expressed in Figure 2 -

Best Configuration Selection
For a single PLM, there are l × a × 3 possible configurations because e f req(word t ) can be computed in three ways of M ∈ [W ords2T arget, CLS2T arget, SEP 2T arget] for all g (i,j) ∈ G. For instance, in the case of the BERT-base, which consists of 12 layers and 12 attention heads per layer, there are 12 × 12 × 3 = 432 possible configurations. We compute M atch m and Ranking Score scores for every g (i,j) , M pair and select the best configuration for the PLM based on the Ranking Score.

Baseline
To evaluate the performance of our method more precisely, we propose a reasonable baseline using Term Frequency-Inverse Document Frequency (TF-IDF) as follows: where D train and D dev are sets of sentences of the training set and validation set, and d train and d dev are sentences sampled respectively from these datasets. f wordt,d dev means the number of occurrences of word t in the sentence d dev . len(d dev ) is the number of words in d dev . The more word t occurs in the sentence and less included in the whole training set, the larger TF-IDF for word t in the sentence will be. This is why the TF-IDF is considered a statistical measure of word specialty in a particular document. Word counting method assigns 1 f word t ,D train which leads to rare words having larger e f req values. In addition, the random baseline method randomly gives e f req value of the target word. In experiments, we show that the TF-IDF model performs better than the other two methods, making it a suitable option as a reasonable baseline.

Results
In Table 2, we report the results of our method to various PLMs on the validation set and test set. Note that without a few exceptions, our method combined with PLMs performs better than baselines, including the one with TF-IDF (0.5145 ranking score). Among PLMs, the BERT-large-uncased and DistilBERT-base-uncased models record the best performance.
Here we mention several takeaways from our results. First, particular PLMs (GPT-2, XLNet) report comparably lower ranking scores against the other PLMs and the TF-IDF baseline. Since GPT-2 adopts a Transformer decoder to its architecture, its attention distribution is leaned on the first word of a sentence (Vig, 2019). This results in a sub-optimal solution where each first word for all 390 sentences in the validation set becomes the top-1 e f req high score word. On the other hand, XLNet model tends to focus more on punctuation marks such as period (.) and comma (,). Specifically, the XLNet-base model predicts that a period should be one of the top-4 high e f req words for 289 sentences in the validation set, even though the prediction is correct only for 48 cases. Table 3: Top-4 emphasis frequency words in two sentences from gold and our model. Gold: from gold e f req values, Ours: DistilBERT-base-uncased's best configuration. Words are sorted based on e f req values and ranks are in parentheses. M1-M4, R: sentence-wise M atch 1 -M atch 4 score and ranking score for the corresponding sentence.
Second, although the results from our method record relatively lower score than that of the supervised baseline model (DL-BiLSTM+ELMo model) proposed in Shirani et al. (2019), we find that it generates quite meaningful emphasis selections. In Table 3, the DistilBERT-base-uncased model selects nest, a, web, friendship to emphasize in S2, which results in "The bird a nest , the spider a web , man friendship . ". Instead, the gold generates "The bird a nest , the spider a web , man friendship . ", and it seems that our model's result is also valuable.
Third, the DistilBERT models achieve high performance despite its small number of model parameters. We conjecture that the distillation techniques applied to build the DistilBERT model function as an ensemble of the attention heads from its parent model. Besides, uncased models show better performance than cased models. We hypothesize that preserving a word's capital letters is meaningful when selecting proper words to be emphasized.
Lastly, ensembling several PLMs is certainly beneficial. We ensemble the top-5 models by averaging over their e f req values and achieve the 0.6898 ranking score, which is significantly higher than those of individual PLMs.
(a) DistilBERT-base-uncased (b) BERT-large-uncased Figure 3: Layer-wise ranking score of DistilBERT-base-uncased and BERT-large-uncased models. Each dot represents a configuration which records the ranking score above that of the random baseline. Dotted lines correspond to the ranking score of TF-IDF baseline.
For further analysis, we investigate how many attention heads are capable of selecting proper words to emphasize. In the case of the top-2 models (DistilBERT-base-uncased and BERT-large-uncased), we probe the layer-wise ranking scores of individual attention heads. We find that there exist attention heads specialized for word emphasis (ones recording high ranking scores). For both cases, it seems that there is a gap between topmost attention heads and the others. For instance, the configuration with secondary ranking score reports 0.5975, which is 0.0445 lower than the score of top-1 in the BERT-large-uncased model.

Conclusion
We have proposed a zero-shot emphasis selection method, focusing on investigating whether pre-trained language models contain enough knowledge to select proper words to be emphasized. We have found that some PLMs report comparable performance, confirming that some specialized attention heads of PLMs have ability to detect meaningful words.