Label-Enhanced Hierarchical Contextualized Representation for Sequential Metaphor Identification

Recent metaphor identification approaches mainly consider the contextual text features within a sentence or introduce external linguistic features to the model. But they usually ignore the extra information that the data can provide, such as the contextual metaphor information and broader discourse information. In this paper, we propose a model augmented with hierarchical contextualized representation to extract more information from both sentence-level and discourse-level. At the sentence level, we leverage the metaphor information of words that except the target word in the sentence to strengthen the reasoning ability of our model via a novel label-enhanced contextualized representation. At the discourse level, the position-aware global memory network is adopted to learn long-range dependency among the same words within a discourse. Finally, our model combines the representations obtained from these two parts. The experiment results on two tasks of the VUA dataset show that our model outperforms every other state-of-the-art method that also does not use any external knowledge except what the pre-trained language model contains.


Introduction
Metaphor is a type of figurative language and its essence is understanding and experiencing one kind of thing in terms of another (Lakoff and Johnson, 1980). As a common language expression, we often use metaphors to express our thoughts vividly and concisely in daily communication. For example, in the sentence It is one of the keys for success of a commercial product, keys is used to help understand the importance of It. However, this characteristic of metaphor makes it challenging to identify metaphors in texts. But the identification of metaphors is meaningful and can help us to understand the meaning of the texts, from which many downstream applications such as machine translation (Koglin, 2015) and opinion mining (Shutova et al., 2013) can benefit.
Recent metaphor researches (Gao et al., 2018;Mao et al., 2019), and ACL 2020 Metaphor Shared Task  regard it as a sequence labeling task. Although many previous works have explored ways to enhance the contextualized representation within a sentence (Gao et al., 2018;Mao et al., 2019), or to introduce some external knowledge (Rohanian et al., 2020;Wan and Xing, 2020), most of them do not make full use of the information in the dataset, from which the metaphor identification process may benefit.
Firstly, when considering the metaphoricity of the target word, the metaphor information of other words in the sentence can also be helpful. E.g., in the sentence He find himself in the position of the gambler who gambled all and lost, gambler and gambled are metaphors. The word gambled is the action of the gambler, and it's reasonable for a gambler to gamble. Thus, the model might prefer to classify gambled as literal. However, if we know that gambler is a metaphoric word, then it is obvious that gambled is also a metaphoric word that refers to the risky thing he did. Based on this observation, we propose a novel label-enhanced contextualized representation method to introduce the contextual metaphor information. It embeds the label of each word(i.e. metaphoric or literal) in the same space as the output of the encoder first and then takes both the output of the encoder and the label embedding as the input of a transformer (Vaswani et al., 2017). We believe it could enhance the reasoning ability of the model by attending to metaphor information of other words in the sentence. Besides, marking the metaphoric words in context could also help the target word understand the context better, especially in the com- Figure 1: An example that shows the two occurrences of word gambled within a discourse plicated sentence, because metaphorical words are not used as their literal meaning, increasing the difficulty of understanding the context. To the best of our knowledge, we are the first that proposes this method.
Secondly, Some existing benchmark metaphor datasets, such as VUAMC, contain sentences from long articles, and the contextual information in the articles will be very useful for metaphor identification. Some previous work used paragraph embedding (Mu et al., 2019) or neighbouring sentence representation (Dankers et al., 2020). Based on this, we use a discourse-level attention architecture that could capture both global and local features in the whole discourse for the target word. First, we introduce the work of Dankers et al. (2020) to extract local information. Then, we propose an improved method of Global Attention (Zhang et al., 2018), which is called position-aware global memory network, to represent global information of the target word. It is based on the observation that a metaphor brings another domain/frame into the discourse, so it is likely that metaphors mapping to the same domain/frame reoccur throughout the discourse, especially among the same words. Specifically, our model uses an attention mechanism between the target word and its other occurrences in the discourse. Figure 1 shows the two occurrences of word gambled in two sentences within a discourse.
Based on the above sentence-level and discourselevel methods, we propose a novel hierarchical contextualized representation model for metaphor identification, as shown in Figure 2. To verify the effectiveness of our model, we conduct experiments on the ALL POS and Verbs tasks of the VU Amsterdam Metaphor Corpus (VUA) (Steen, 2010). Our model outperforms several baseline models with 1.1% (VUA ALL POS) and 1.0% (VUA Verbs) improvement in F1 score. In addition, the results of our model surpass DeepMet (Su et al., 2020), which is the state-of-the-art model in metaphor identification, with the same experiment setup.
Our contributions in this paper can be summa-rized as follows.
• We propose a novel label-enhanced contextualized representation method to enhance the model's ability to reason about contextual metaphoric relationships and better understand the meaning of context.
• At the discourse level, we use an improved position-aware global memory network to introduce the long-range discourse information.
• Experiment results on the two tasks of the VUA dataset show that our model outperforms the state-of-the-art methods that also do not use external knowledge.
2 Related work

Metaphor Identification
Most of the early metaphor identification works employed machine learning approaches using linguistic features (Turney et al., 2011;Tsvetkov et al., 2013;Mohler et al., 2013;Klebanov et al., 2016;Bulat et al., 2017a). In recent years, neural metaphor identification has become highly popular for its end-to-end fashion and better performance. Wu et al. (2018) combined CNN and LSTM to obtain local and long-range information and achieved the best performance in the NAACL 2018 VUA Shared Task (Leong et al., 2018). Gao et al. (2018) applied the combined embedding of GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) as the input of a Bi-LSTM, which introduced the contextualized word embedding. Based on the model of Gao et al. (2018), Mao et al. (2019) proposed RNN_HG and RNN_MHCA inspired by MIP (Group, 2007) and SPV (Wilks, 1978) theory respectively and gained certain improvements. Multi-task learning (Dankers et al., 2019;and linguistic features (Rohanian et al., 2020;Wan and Xing, 2020) have also been explored to applied to the deep learning model. Su et al. (2020) achieved the best performance in ACL 2020 Metaphor Shared Task(Leong et al., 2020) by taking global text context, local text context, query word, General POS, and fine-grained POS as the input of a RoBERTa model (Liu et al., 2019).
There are also some works done on the relationlevel metaphor identification. The early works employed machine learning models using linguistic features as well, including conceptual semantic features (Tsvetkov et al., 2014), visual fea-tures , and attribute-based semantics (Bulat et al., 2017b). The recent works mainly used deep learning model. Rei et al. (2017) proposed a supervised similarity network for relation-level metaphor identification. Zayed et al. (2020) introduced a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation, which achieved state-of-the-art results.
In this paper, we consider the token-level metaphor identification task for long discourse, and different from these previous works, we start from the known information that the data can provide, and use the label-enhanced contextualized representation to strengthen the model's reasoning ability by introducing contextual metaphor information.

Discourse-level Representation
Considering the datasets contain discourse information, some researchers enhanced word contextualized representation by introducing discourse features. Jang et al. (2015) used hand-crafted discourse-level features such as topical information and semantic relatedness. Mu et al. (2019) obtained the discourse contextual information by embedding the surrounding paragraph. Dankers et al. (2020) applied general attention and hierarchical attention on both the target sentence and its neighbouring sentences to get discourse representation. However, Mu et al. (2019) and Dankers et al. (2020) only considered the context close to the target word and used the same method as processing a sentence, which is not suitable for long context. To adapt to longer context of discourse, we consider the occurrences of words to avoid processing texts that are too long and capture the consistency in the use of metaphors in discourse. The work of Jang et al. (2017) had a similar idea with us, which paid attention to the similar words that appear globally in the discourse. However, their method must predefine frame and know what frame the target word belongs to. This limitation of their method in scalability makes it inapplicable to the general metaphor datasets, such as VUA.
In the field of Named Entity Recognition research, where document-level tasks are more common, there are some document representation methods that we can use for reference. Zhang et al. (2018) proposed Global Attention that establishes the relationship among the occurrences of the word within a document. Luo et al. (2020) adopted a key-value memory network to record the history hidden states. Based on their works, we propose an improved position-aware global memory network.

Baseline Model
Given a sentence with a sequence of words {x 1 , x 2 , ..., x n } , our goal is to predict its metaphor label {y 1 , y 2 , ..., y n } as accurately as possible. Since many previous works (Dankers et al., 2020;Neidlein et al., 2020) have proven the effectiveness of the pre-trained language model in metaphor identification, we use BERT (Devlin et al., 2019) as our baseline model. Specifically, we follow the work of Dankers et al. (2020). That is, a word is considered metaphoric if any of its sub-word units tokenized by the Byte Pair Encoding(BPE) algorithm used in BERT is predicted as metaphoric. Thus, we can get the output hidden states of BERT:

Discourse-level Representation
Sentences in some metaphor datasets, such as VUA, come from long texts. The semantic meaning of the sentences needs to be accurately obtained by considering the context at the discourse level. Therefore, we use hierarchical attention to extract the neighbouring sentence representation and positionaware discourse-level attention for capturing longrange dependency. Neighbouring sentence representation Here we follow the work of Dankers et al. (2020). We use a context window of size 2k + 1 sentences, which comprises k preceding sentences, the target sentence, and k succeeding sentences. Then they are fed into a hierarchical attention architecture (Yang et al., 2016), where the first encoder is BERT, and the second encoder is a transformer (Vaswani et al., 2017). At last, we concatenate the neighbouring representation N obtained by the hierarchical attention with the output hidden states h i of BERT. Position-aware global memory network To utilize the information of the whole discourse, we borrow the strategy of Global Attention (Zhang et al., 2018) to capture long-range dependency among the same words within a discourse. The main idea is to employ a global attention mechanism between the target word and other occurrences within the discourse. Considering the time cost, we adopt the method of Luo et al. (2020), which records the history hidden states of other occurrences for each word instead of recalculating them. Thus, we call it global memory network.
Specifically, we record hidden states h i produced by the baseline model BERT for each word x i in sentences. Then, we put the hidden states of the x i 's occurrences in the discourse into one group. The group containing word x i could be represent as follows: is the hidden states of x i 's occurrences. For each token x i and its output hidden states h i in the given sentence, we can get the corresponding group G.
Although there is no explicit sequence relation inside G, the position of words in G still affects their contribution to the target word. For example, the words close to the target word may influence it more. Therefore, based on the global memory network, we add position embedding to G. Assuming that x i is located at the t th place in G, we remove the record of x i in G and then get a matrix: We use h p j = h m j + pos j , so the M P t can be represent as: A dot-product attention is applied on e i and h p j ∈ M P t to get the response of the global memory network: Finally, h i is used to update G by replacing h m t . Then, we can get the final representation by fusing h i , N and r i : where N is the neighbouring sentence representation.

Sentence-level Representation
In this section, we propose a novel label-enhanced contextualized representation that explicitly introduces contextual metaphor information, which is useful for understanding because the Specifically, the label embedding is adopted to represent each label, and then the early prediction is used to provide reference metaphor labels for the label embedding module. Label embedding To fuse contextualized representation of words with label information, we use label embedding to map labels to the same space as the contextualized representation's. That is, every type of label(i.e. metaphoric or literal) corresponds to a vector via the label embedding. Therefore, we can obtain the label embedding l i of the word x i according to its label y i . Then we take the sum of d i and l i as the input of a transformer encoder layer. Considering the particularity of label embedding, we modified the Q, K, and V in the standard transformer architecture (Vaswani et al., 2017): where the l pad is a padding embedding which has the same dimension as l i . This is because the l i in the training steps comes from the golden label y i , which will lead to leakage of the label if Q contains the label information of the word itself. That is, the output of target word would contain its own golden label information. Similarly, K i and V i will introduce the label information of word x i when we calculate the attention of q i to K and V . So we add a mask matrix to the self-attention mechanism: where the diagonal elements are all 0. It means each word ignores itself when calculating attention. Then we take Q, K, V and AttentionM ask as the input of the transformer encoder: We useŷ i as the final prediction in the testing stage. Early prediction The strategy of introducing contextual labels we adopt above uses contextual golden labels but not the labels predicted by the model in the training phase, which is similar to the teacher forcing strategy that is widely used in text generation tasks. However, it will be invalid in the testing stage since the golden label of the test set cannot be used as known information. To address this deficiency, we add an early prediction module: In this way, the model can predict the label of x i in advance. In the testing stage, the predicted metaphor labelŷ ep i is provided to the label embedding phase as a substitute for the golden label.

Training Details
The final training objective of our model consists of two parts: (1) the early predictionŷ ep i and (2) the final predictionŷ i , both of which use cross-entropy loss function: where theŷ ep ic andŷ ic are the predicted probabilities for the true label y i , and the w y i is the loss weight of y i . The D represents the whole dataset. The final loss is defined as the weighted summation of L EP and L LE : where γ denotes the weighting parameter.

Method
VUA ALL POS VUA Verbs P R F1 P R F1 (Wu et al., 2018) 60.8 70.0 65.1 60.0 76.3 67.1 (Gao et al., 2018) 68.4 59.7 63.8 --- (Mao et al., 2019) 71.7 60.2 65.5 --- (Dankers et al., 2020) (Birke and Sarkar, 2006) and MOH-X (Mohammad et al., 2016) datasets which are commonly used in the previous works. This is because neither of these two datasets contains discourse information, and words other than the target word within a sentence are all annotated as literal, which is useless for our model. Nonetheless, we believe that the results on the two tasks of the VUA dataset can well demonstrate the superiority of our model in both ALL POS and Verbs metaphor identification. Table 1 shows the descriptive characteristics of the VUA dataset: the number of texts, sentences, tokens, and class distribution information for All POS and Verbs tasks.

Setup
We try to keep the hyper-parameters consistent with previous works which used BERT in metaphor identification. Our model is trained with a batch size of 16 for 4 epochs using the AdamW optimizer with a linear learning rate scheduler and a warmup period of 10%. The maximum learning rate is 5e-5. We apply dropout to our model with a rate of 0.1. The weight in the loss function w y i = 2 if y i = 1 (metaphor), otherwise w y i = 1. The λ used in discourse-level representation is set as 0.8 empirically. The k in neighbouring sentence representation is 2 as same as Dankers et al. (2020). The γ used in early prediction is set as 0.2.

Results and Discussion
We compare our model with existing approaches which do not use external knowledge. We do not compare with the works that divided the dataset into the train set and test set by themselves, such as Wan and Xing (2020). Since Gao et al. (2018) and Mao et al. (2019) used a different subset of VUA, we use the results reported by Neidlein et al. (2020) on VUA ALL POS and VUA Verbs for comparison. Since the F1 score(71.4) of our BERT baseline is higher than that(70.3) in Dankers et al. (2020) even though the two models are basically the same, we re-implement their method. Our experimental results are obtained by averaging the results of five random runs. Table 2 shows that our model surpasses the highest results by 1.1% and 1.0% on VUA ALL POS and VUA Verbs tasks, respectively.
The current state-of-the-art model is Deep-Met (Su et al., 2020), which takes global text context, local text context, query word, general POS, and fine-grained POS as the input. To make the comparison fairer, firstly, we removed their ensemble module, because simply modifying the hyperparameters to vote is of little research significance, though it is helpful for the performance. Secondly, the DeepMet after removed the ensemble part is a 10-fold voting model, so we also adopt this strategy and remove our discourse-level module because DeepMet divides the training and validating sets at the sentence level, which will cause the sentences   Dankers et al. (2020), which is the same as the model in Table 2. in the same discourse to be scattered in the training set and validating sets. This will lead to incomplete discourse information in the training and validating sets. Finally, we use RoBERTa (Liu et al., 2019) as the baseline model same as DeepMet instead of BERT. This type of our model is marked as Ours cv . We rerun the code of DeepMet and compare the results which are shown in Table 3. The F1 score of our model are 1% and 1.4% higher than DeepMet on ALL POS and Verbs, respectively.
As is shown in Table 2 and Table 3, both Deep-Met and the proposed model show more gain for recall rather than for precision compared with BERT. In general, advanced pre-trained models, such as RoBERTa (Gong et al., 2020), or more semantic information (Dankers et al., 2020) will improve recall and worsen precision. Because the metaphor is a special(or high-level) way to use, it is difficult to identify complicated metaphorical expressions when the model cannot fully understand the meaning. Our model introduces contextual metaphorical information to enhance the model's ability to understanding complicated contexts. Meanwhile, by using the global memory network, the model might benefit from another well-understood context that contains the target word when processing the same word in a context that is difficult to understand. DeepMet used RoBERTa and reduced the threshold of classifying a word as a metaphor, making the model inclined to predict words as metaphors.   comparable results against the baselines. It can be seen that our model performs well on news and academic genres. This is because each discourse in the two genres mainly describes one single event or stuff, which has strong logic internally. Thus, it is likely that the same metaphor appears in the discourse, which has a certain metaphorical consistency. Meanwhile, the label-enhanced representation module can enhance the ability to identifying the metaphorical expressions in long sentences which are common in these two genres. The improvement obtained on conversation is mainly because our model introduces more discourse information, which is important for understanding sentences in conversations.

Ablation Study
In this experiment, we remove the label-enhanced contextualized representation, neighbouring sentence representation, and position-aware global memory network modules from our model separately, and the experiment results are shown in Table 5. The last row in the table w/o position information refers to remove the position embedding from the position-aware global memory network. It turns out that each module of our model is useful, and removing any part of our model will cause the result to drop.

Influence of Hyper-parameters
Memory size In the position-aware global memory network module, if a word occurs more than T times, we only record its first T occurrences. Figure 3 shows the effect of T on the performance of our model. when T is 10, the result of our model is the best. Since the meaningful words are hard to appear many times, the performance of our model declines when T is greater than 10, which may record more meaningless stop words.
The effect of k We use k to control the number of neighboring sentences. Figure 4 shows that the performance grows with k, and becomes stable when k ≥ 2. Considering the time and memory cost, we choose k = 2 in the model.  fewer metaphoric words (M≤1). The results show that our model is 1.9% and 1.5% higher in F1 score than BERT and DeepMet respectively when M>1. This shows the effectiveness of our label-enhanced contextualized representation, because when a sentence contains multiple metaphoric words, it may be able to provide richer contextual metaphor information for the reasoning process. Moreover, we notice that the F1 scores of all models are very low when M≤1. This may be because there are many short sentences, which makes it difficult to understand the meaning of the words in the sentences. This needs further attention for metaphor identification research.

Error Analysis
Although introducing wider discourse information and label information, our model has limitations as well. If a word only appears once in the discourse, the global memory module will be invalid. In some cases, it is also difficult to judge the metaphoricity of some words even if they appear several times in the discourse. E.g., in the sentences Tyson is not a gambling man(VUA ID: aa3-fragment08-215) and If you were a gambling man it would not affect you(VUA ID: aa3-fragment08-232) where our model fails, the two gambling have similar context and usage, so it is difficult for our model to make the word benefit from another occurrence. Moreover, short sentences are also challenging because there are little contextual label information and semantic information, e.g., No, but getting(VUA ID: kb7-fragment48-13446), where there is not enough information for inference.

Conclusion
In this paper, we propose a hierarchical contextualized representation model to strengthen the model's ability to leverage contextual information. Our model makes use of the contextual metaphor information in the sentence level and the long-range relation of the words in the discourse level. We improve the ability of the model to reason the contextual metaphoric relationships and understand the meaning of context by introducing contextual label representation for the target word. To obtain broader discourse information, we adopt a position-aware global memory network to extract relations among the occurrences of words in discourse. The results of our model on the two tasks of VUA dataset surpass the state-of-the-art models which also do not use external knowledge.
In future work, we will explore changing the golden label used in the label embedding stage to the iterative prediction result, which may avoid the deviation caused by the absence of golden labels during testing. Meanwhile, albeit limited, the work of Jang et al. (2017) could provide further direction for this research, such as using words belonging to the same topic/frame/domain instead of only the same words.