Enhanced Metaphor Detection via Incorporation of External Knowledge Based on Linguistic Theories

Use of external knowledge is an important and effective method applied widely in metaphor detection. Although existing knowledge-based methods perform well, when leveraging ex-ternal knowledge, they take little consideration on linguistic theories of metaphor detection. Based on Metaphor Identiﬁcation Pro-cedure (MIP) and Select Preference Violation (SPV), directly using examples and deﬁnitions of words from the Oxford Dictionary 1 , we pro-pose two BERT-based models for metaphor detection: ExampleBERT and DeﬁnitionBERT. Experimental results show that our methods achieve state-of-the-art performance on two established metaphor datasets. Furthermore, we show that our DeﬁnitionBERT is highly inter-pretable.


Introduction
Metaphor Detection (MD) is a high-level natural language processing (NLP) task, which aims to identify the metaphorical expressions/words in the text. Identifying metaphors, a cognitive activity in which humans use their experience in one field to explain or understand another field , is a challenging task that requires rich prior knowledge and a high level of semantic understanding.
In earlier studies, many resources were exploited to develop rule-based and machine learning systems, such as domain types,word abstractness/concreteness (Turney et al., 2011;Tsvetkov et al., 2014). Recently, many deep learning based methods have been applied to metaphor detection (Kehat and Pustejovsky, 2020;Rohanian et al., 2020), which achieve the current stateof-the-art performance. They also make use of external knowledge. Hence, we can infer that incor-porating external knowledge is indeed important. In this paper, we show that some level of lexical semantic information, even if its just dictionary entries, can improve performance in identifying verbal metaphor.
A recent study (Mao et al., 2019) shows the effectiveness of taking advantage of linguistic theories when identifying metaphors. According to one of the linguistic theories, Metaphor Identification Procedure(MIP) (Semino et al., 2007;Steen et al., 2010), a metaphor is identified if the literal meaning of a word contrasts with the means that word takes in this context. For example, in the metaphorical sentence, the deep learning model is flying during training, the context meaning of 'flying' is 'the loss of the model is getting bigger and even become indefinite', which contrasts with its literal meaning of 'move through the air using wings' according to Oxford Dictionary. An alternative approach is Select preference Violation(SPV) (Wilks, 1975(Wilks, , 1978, wherein a metaphor is identified by noticing a semantic contrast between a target word and its context. For example, in the deep learning model is flying during training, 'fly' is unusual in the context of 'model' and 'training': a model cannot fly. To incorporate external knowledge, we take advantage of the linguistic theories of metaphor detection. Following SPV, we use examples of the word from Oxford Dictionary, where the literal meanings of the word are expressed in the contextual examples for the most of time. Hence, some common contextual information of the word can be inferred from examples. In accordance with MIP, we use the definitions of a word from the Oxford dictionary, which directly express the literal meanings of the word. To better use this knowledge and conform the idea of linguistic theories, we propose (1) Ex-ampleBERT, which, before it identifies metaphor, learns the common contextual information of the target word and (2) DefinitionBERT, which, while identifying a metaphor, directly takes advantage of the literal meanings of the target word. In particular, our contribution is two-fold as follows: 1. We directly use the examples and definitions of the word from the Oxford Dictionary. To the best of our knowledge, it is the first time this knowledge is incorporated into metaphor detection.
2. We propose ExampleBERT and Definition-BERT. Experimental results show that both of our models can outperform the state-of-theart models on two verb metaphor detection datasets. Also, experimental analysis proves that our DefinitionBERT is indeed effective and has a strong interpretability.

Related Work
Metaphor identification is a linguistic metaphor processing task that identifies metaphors in textual data. Most of the earlier works on metaphor identification were based on feature-engineering. Unigrams, imageability, concreteness, abstractness, word embedding and semantic classes are features commonly employed by supervised machine learning (Turney et al., 2011;Assaf et al., 2013;Tsvetkov et al., 2014;Klebanov et al., 2016). Recently, many deep learning based methods have been proposed, which treat metaphor identification as a sequence tagging task. Considering whether to use external knowledge directly, we divide these methods into the following two categories: Use of pre-trained word embeddings. The first methods use only pre-trained word embeddings, which are commonly used in NLP tasks. (Wu et al., 2018) proposed a model based on word2vec (Mikolov et al., 2013) and PoS tags and word clusters, which are encoded by a Convolutional Neural Network (CNN) and Bi-LSTM. The encoded information is directly fed into a softmax classifier. (Gao et al., 2018) and (Mao et al., 2019) concatenated Glove  and ELMO (Peters et al., 2018) as the inputs of Bi-LSTM, the difference is (Mao et al., 2019), inspired by linguistic theories, uses attention mechanism to improve performance. External knowledge. The second methods use different kinds of external knowledge to boost performance. (Kehat and Pustejovsky, 2020) use Vision-Language datasets to derive the concreteness scores of words and then convert them to Visibility Em-beddings, which, like with (Gao et al., 2018) , finally feed to Bi-LSTM.  propose a multi-mask learning method, which transfer knowledge from Word Sense Disambiguation (WSD); to improve performance, they also employe Graph Convolution Neural networks (GCN) with dependency trees. Like , (Rohanian et al., 2020) also use GCN, but they incorporate annotations for verbal multiword expressions. Obviously, our methods belong to this second category.

BERT
BERT (Devlin et al., 2019) is a powerful language representation model, whose architecture is a multilayer bidirectional transformer encoder. The BERT model is pre-trained on a large corpus and two novel unsupervised prediction tasks, i.e., masked language model and next sentence prediction tasks are used in pre-training. Here, it must note that BERT is chosen as our base model, not only because of its excellent performance on many other NLP tasks, but also BERT is a bidirectional language model. More specifically, during training, BERT randomly mask some words in the sentence and then use all the unmasked words to predict them based on a self-attention mechanism. Hence, this procedure allow BERT to learn the common context of the target word, which is very useful for our task because if a target word appears in uncommon contexts, then BERT is more likely to predict it to be a metaphorical word.

BERT(Token-CLS)
To incorporate BERT to our metaphor detection task, we take the final hidden state of the token corresponding to the target word, and then add a classification layer to predict whether or not the target word is metaphorical. We compare this model as our baseline with ExampleBERT and Definition-BERT mentioned below.

ExampleBERT
The intuition behind SPV is that metaphoricity is identified by detecting the incongruity between a target word and its context. Hence, we assume that, if a model has learned the common context information of a target word, then the model works more effectively. As described in Section 3.1 above, a bidirectional pre-training model satisfies our requirement. Therefore, our proposed ExampleBERT  model is built based on the standard BERT architecture (Devlin et al., 2019) which is based on the two-stage 'Pre-training'-then-'Fine-tuning' pre-training language model approach, that recently become enormously popular in NLP. During the pre-training phase, we collect examples of the target word under its definitions from the Oxford dictionary and use only MaskLM as our pre-training objective. Here, we continue pre-training based on the pre-trained uncased BERT BASE model from (Wolf et al., 2020). The train strategy is the same as (Devlin et al., 2019). The training data generator chooses fifteen percent of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token eighty percent of the time, (2) a random token ten percent of the time, and (3) the unchanged i-th token ten percent of the time. Here, our hypothesis is that most of the examples of a target word are expressing its literal meanings. Thus, whether or not a target word is selected, the model can also learn some common context information of a target word. During fine-tuning phase, we directly use the pre-trained ExampleBERT to fine-tune on the metaphor detection datasets as described in Section 3.2 above.

DefinitionBERT
Based on MIP, we assume that if we tell the model directly the literal meanings of the target word, then the model will work more effectively. Fortunately, BERT can explicitly model the relationship of a pair of texts, and this has been proved to be beneficial to many pair-wise natural language understanding tasks. Therefore, to fully leverage the definitions of words, we construct context-definition pair based on all possible definition of the target word from the Oxford dictionary, thereby treating MD task as a sentence pair classification problem seemingly. But, different from (Huang et al., 2019), here we cannot and don't need to match multiple definition and sentence directly one by one, because the contextual meaning of a metaphorical word is different from all its definitions. Also, we don't know which definition the contextual meaning of a non-metaphorical word corresponds to. Moreover, although there are word definition collections in WordNet (Miller, 1995), we find they cannot express accurately the literal meaning of words, and some of them are exactly the metaphorical meanings. For example, in WordNet, one of the definition for 'drink' is 'take in liquids'. On one hand, in the sentence, car drinks gasoline, that definition does not help us, or a model, identify that 'drink' is metaphorical. On the other hand, the Oxford Dictionary definition -'take (a liquid) into the mouth and swallow' -can be of help. A car, which has no mouth and cannot swallow, is obviously unsuitable here. Hence, the latter is helpful to us.
As shown in Figure 1, we directly concatenate the multiple definitions of the target word, and use "[SEP]" to separate them. Finally we use the context-definitions pair as the inputs for BERT. After encoding by BERT, we take the final hidden state of the target word as its context meaning. To obtain the literal meaning of its definition, we also take the final hidden states of the tokens of each definition, and use Mean-Pooling to average the hidden states of each definition, which represents literal meaning expressed by the definition. This is formulated as follows : where f b represents the BERT encoder, and f m : R n×d → R d is a mean pooling function that maps from output vectors of n tokens to the definition vector. Then, we concatenate the vectors of the target word and definitions into one vector and apply a Feed-Forward Neural Network (FFNN) over the concatenated representations. This is formulated as: where h tt indicates the hidden state of the target word from BERT. Then h f is taken as input for a logistic regression classifier to make the prediction.

Dataset
To be compatible with previous work (Gao et al., 2018;Mao et al., 2019;Rohanian et al., 2020), we evaluate the proposed models using three widely used datasets for metaphor detection.
VUA ( (Charniak et al., 2000). It contains 3737 sentences and the average length of sentences is 28.3. Each sentence has a single annotated target verb. There are only fifty unique target verbs in this dataset, which means, that for one target verb, there are many training samples.

Baselines
RNN-ELMo (Gao et al., 2018) This very representative model uses Glove and ELMo as features for sequential metaphor identification. The ELMO word vectors they trained has been adopted in many subsequent works.

RNN-HG & RNN-MHCA (Mao et al., 2019)
These are BiLSTM-based systems grounded in linguistic theories of SPV and MIP, which are the first to explore using linguistic theories to directly inform the design of Deep Neural Networks (DNN) for metaphor identification. They use the Glove and ELMO word embeddings as the literal meaning of a word. MUL-GCN  This is a multi-task learning model for metaphor detection that, to improve performance, features graph convolutional neural networks to appropriately capture the following; important context words, the control mechanism to emphasize the target words, and the transference of knowledge from WSD. BERT+MWE-Aware GCN (Rohanian et al., 2020) This is a neural model to classify metaphorical verbs in their sentential context using information from the dependency parse tree and annotations for verbal multiword expressions. It evaluates on the MOH-X and TroFi datasets.

Setup
For pre-training ExampleBERT, we collect about 40,000 examples of the verb words in all three datasets (See Section 4.1). The batch size is 128; the learning rate is 5e-5, and we train over ten epochs. For DefinitionBERT, because different words have a different number of definitions and to achieve batch computing, we choose the most common three definitions 2 for each word. If a word don't have three definitions, we simply use "no To pre-train ExampleBERT and fine-tune Def-initionBERT, we all use the pre-trained uncased BERT base model from (Wolf et al., 2020). The number of its transformer blocks is 12, the number of self-attention heads is 12, and the number of the hidden layer is 768. For the FFNN in Eq. 2 of DefinitionBERT, we simply use a 256 hidden units of fully connected layer, followed by a classification layer. The two models are all fine-tuned with shuffled minibatches of size 32. The Adam optimizer is used to update the parameters, and the initial learning rate is set at 5e-5.

Results
Results in terms of accuracy (Acc), precision (P), recall (R) and F1-score are given in Table 1. Scores with the best performances across all models are indicated in bold. Results not reported are indicated by (-). As shown in Table 1, our ExampleBERT and DefinitionBERT achieve state-of-the-art performance on VUA VERB and MOH-X datasets.
VUA VERB dataset. For the VUA VERB datasets, even our proposed BERT-Baseline model achieves excellent performance, gaining improvement over the best of the other methods (MUL-GCN) by a large margin: 2.28% and 3.37% on accuracy (Acc) and F1, respectively. Compared with our BERTBaseline, regarding F1, our Exam-pleBERT and DefinitionBERT show improvement of 0.36% and 1.02%, respectively.
MOH-X dataset. For the MOH-X dataset, our DefinitionBERT, compared with BERTBaseline and the best of the other models, achieves significant improvement across all results.
TroFi dataset. However,for the TroFi dataset, the performance of our ExampleBERT and Defini-tionBERT is somewhat bad than other state-of-theart results. Compared with our BERTBaseline, for F1, ExampleBERT and DefinitionBERT still show a gain of 0.54% and 1.10%, respectively, indicating that our method is effective also. The TroFi dataset, contains fewer samples than the VUA dataset, but with longer average sequence length (28.3). Thus, on one hand, it is more difficult for DefinitionBERT to capture the relationship between the target and its definitions. On the other hand, because the dataset contains only fifty unique verbs, there are many samples for a target verb, and most express the literal meanings of the word, e.g., the dataset contains 71 literal sentences and 25 metaphor sentences of the target word 'absorb'. Thus, the models can learn sufficient common contextual information and literal meanings of a target word from the dataset. That is to say, the prior knowledge we add provides only limited help. However, taking a step back, compared with the performances of other well-designed models, the performance of our model does not lag too far behind; therefore, we believe our method is still acceptable. Moreover, the results of MOH-X and TroFi dataset suggest that our two models are more useful when there exists only a small amount of training corpus.
We note that the DefinitionBERT always perform expect on precision, the possible reason is , when the model cannot obtain the context meaning of the target world accurately(i.e.,when the sentence is complex) , or the definitions we get from the dictionary are highly summarized, there would exist a gap between the context meaning and literal meaning, although they express the same meaning actually. So the model could predict the literal one to metaphorical more likely, then the precision could be lower(precision=TP/(TP+FP), FP increased). Finally, this could be a inspiration that how to improve our methods in the future work.

Analysis
As described in Section 3.4, if DefinitionBERT works correctly, it should learn the differences and relationships between the contextual and literal meanings expressed by the definitions of the target word during identifying. Therefore, to understand how DefinitionBERT uses the definitions we provide, we compute the cosine similarity between h tt and each h di described in Eq. 2. If a word is predicted as metaphorical, then the cosine similarity between its definitions will be very small, and the definition which expresses the literal sense of its contextual meaning would always have the smallest cosine similarity. Inversely, if a word is predicted as non-metaphorical, the value will be larger, and the meaning with the greatest cosine similarity will always be the definition that expresses its contextual meaning.
A specific example is given in Table 2. For example, in the sentence, Her husband often abuses alcohol, 'abuses' is a metaphorical word; its context meaning is 'a man drinks too much resulting in a bad effect'. Thus, we infer this metaphorical meaning is based on the first definition that has the smallest similarity. In the sentence, This boss abuses his workers, 'abuse' is a non-metaphorical word; its context meaning is 'speaking in a insulting and offensive way', which obviously is the third definition that has the greatest similarity. That is to say, our DefinitionBERT takes advantage of the definitions during training. The definitions directly help the model distinguish the contextual and literal meanings of the target word, which exactly is our purpose.

Disscussion
The main reason for the improvements in our experimental results is that we use external knowledge based on linguistic theories, which is very suitable and effective for detecting metaphors. The ways we incorporate the examples and definitions of a word correspond exactly to the two pre-training objectives of BERT, which are also its advantages. However, it seems possible to combine Example-BERT and DefinitionBERT to attain better performance, we can use pre-trained ExampleBERT to fine-tune DefinitionBERT. But, our experimental results show that, although its performance can surpass that of ExampleBERT, but cannot surpass DefinitionBERT. The possible reason is may due to the only MaskLM pre-training objective, the ability of pre-trained ExampleBERT to model the relationship of a pair-wise is weakened.
RNN-HG and RNN-MHCA proposed by (Mao et al., 2019), which are inspired also by linguistic theories, focus more on the model architecture suitable for SPV or MIP; whereas, we focus on external knowledge suited for SPV or MIP. Moreover, we believe our ExampleBERT and DefinitionBERT are just base models that can be further improved by other technology, such as GCN applied in (Rohanian et al., 2020).
Moreover, compared to previous state-of-the-art models, especially knowledge-based methods like Rohanian et al., 2020), our Defini-tionBERT is highly interpretable while achieving excellent performance. As described in Section 4.5, because our DefinitionBERT locates the intended meaning of the metaphor in context, it helps us further interpret metaphors. One approache for metaphor interpretation is Definition Generation proposed in (Zayed et al., 2020), which aims to find the most probable definition/interpretation (if exists) of the highlighted expression among the given definitions. Obviously, our DefinitionBERT is very suitable for this task(dataset). Another approach is Lexical Substitution explored in (Mao et al., 2018), where the metaphoric word/phrase is replaced with its literal counterpart to clarify its semantic meaning. We also believe our DefinitionBERT can be an alternative method for (Mao et al., 2018).

Conclusion
We proposed two simple, but effective, methods for metaphor detection, which achieve state-of-theart performance on two verb metaphor detection datasets. More importantly, we showed that our DefinitionBERT is highly interpretable and can be further applied to metaphor interpretation. For future work, we will explore how to use the external knowledge of words for a sequential task, such as the VUA ALL POS dataset, which is not evaluated Abuse (definitions according to the Oxford Dictionary) d-1 : use (something) to bad effect or for a bad purpose;misuse. d-2 : treat with cruelty or violence, especially regularly or repeatedly. d-3 : speak to (someone) in an insulting and offensive way.
Abuse (samples and cosine similarity between the three definitions) samples label predict d-1-s d-2-s d-3-s 1.Her husband often abuses alcohol .  Table 2: Examples for the word 'abuse' from the MOH-X dataset. 'd-1-s' indicates the cosine similarity between the feature vector of the first definition and the feature vector of the target word extracted from DefinitionBERT.
in this paper. A simple, crude way is to collect all the examples of words in the datasets and then continue to use ExampleBERT according to SPV. If based on MIP, combining the definitions of all words into one sentence like this paper do seems to be a terrible implementation. Moreover, there are several dictionaries (Zayed et al., 2020) giving examples and definitions of the word, and the examples or definitions from different dictionaries are somewhat different in types and contents, which may cause a different result when combined with our methods. Therefore, to obtain better performance, we will try resources from different dictionaries, where the premise is the definitions of the words must be non-metaphorical.