Verb Metaphor Detection via Contextual Relation Learning

Correct natural language understanding requires computers to distinguish the literal and metaphorical senses of a word. Recent neu- ral models achieve progress on verb metaphor detection by viewing it as sequence labeling. In this paper, we argue that it is appropriate to view this task as relation classification between a verb and its various contexts. We propose the Metaphor-relation BERT (Mr-BERT) model, which explicitly models the relation between a verb and its grammatical, sentential and semantic contexts. We evaluate our method on the VUA, MOH-X and TroFi datasets. Our method gets competitive results compared with state-of-the-art approaches.


Introduction
Metaphor is ubiquitous in our daily life for effective communication (Lakoff and Johnson, 1980). Metaphor processing has become an active research topic in natural language processing due to its importance in understanding implied meanings.
This task is challenging, requiring contextual semantic representation and reasoning. Various contexts and linguistic representation techniques have been explored in previous work.
Early methods focused on analyzing restricted forms of linguistic context, such as subjectverb-object type grammatical relations, based on hand-crafted features (Shutova and Teufel, 2010b;Tsvetkov et al., 2013;Gutiérrez et al., 2016). Later, word embeddings and neural networks were introduced to alleviate the burden of feature engineering for relation-level metaphor detections (Rei et al., 2017;Mao et al., 2018). However, although grammatical relations provide the most direct clues, other contexts in running text are mostly ignored.
Recently, token-level neural metaphor detection draws more attention. Several approaches discov- * These authors contributed equally to this work. ered that wider context can lead to better performance. Do Dinh and Gurevych (2016) considered a fixed window surrounding each target token as context. Gao et al. (2018) and Mao et al. (2018) argued that the full sentential context can provide strong clues for more accurate prediction. Some recent work also attempted to design models motivated by metaphor theories (Mao et al., 2019;Choi et al., 2021).
Despite the progress of exploiting sentential context, there are still issues to be addressed. First of all, a word's local context, its sentential context and other contexts should be all important for detecting metaphors; however, they are not well combined in previous work. More importantly, as shown in Figure 1, most token-level metaphor detection methods formulate metaphor detection as either a single-word classification or a sequence labeling problem (Gao et al., 2018). The context information is mainly used for learning contextual representations of tokens, rather than modeling the interactions between the target word and its contexts (Zayed et al., 2020).
In this paper, we focus on token-level verb metaphor detection, since verb metaphors are of the most frequent type of metaphoric expressions (Shutova and Teufel, 2010a). As shown in Figure 1, we propose to formulate verb metaphor detection as a relation extraction problem, instead of token classification or sequence labeling formulations. In analogy to identify the relations between entities, our method models the relations between a target verb and its various contexts, and determines the verb's metaphoricity based on the relation representation rather than only the verb's (contextual) representation.
We present a simple yet effective model -Metaphor-relation BERT (MrBERT), which is adapted from a BERT (Devlin et al., 2019) based state-of-the-art relation learning model (Bal-

Relation Encoder
M L (ℎ ' , ℎ 2 ) (a) classification (b) sequence labeling (c) relation extraction dini Soares et al., 2019). Our model has three highlights, as illustrated in Figure 2. First, we explicitly extract and represent context components, such as a verb's arguments as the local context, the whole sentence as the global context, and its basic meaning as a distant context. So multiple contexts can be modeled interactively and integrated together. Second, MrBERT enables modeling the metaphorical relation between a verb and its context components, and uses the relation representation for determining the metaphoricity of the verb. Third, the model is flexible to incorporate sophisticated relation modeling methods and new types of contexts.
We conduct experiments on the largest metaphor detection corpus VU Amsterdam Metaphor Corpus (VUA) (Steen, 2010). Our method obtains competitive results on the large VUA dataset. Detail analysis demonstrates the benefits of integrating various types of contexts for relation classification. The results on relatively small datasets, such as MOH-X and TroFi, also show good performance and model transferability.

Formulating Verb Metaphor Detection
This section briefly summarizes the common formulations of token-level verb metaphor detection as a background, and discusses the relation between this paper and previous work. The task A given sentence contains a sequence of n tokens x = x 1 , ..., x n , and a target verb in this sentence is x i . Verb metaphor detection is to judge whether x i has a literal or a metaphorical sense. Basic formulations Most neural networks based approaches cast the task as a classification or sequence labeling problem (Do Dinh and Gurevych, 2016;Gao et al., 2018). As shown in Figure 1, the classification paradigm predicts a single binary la-bel to indicate the metaphoricity of the target verb, while the sequence labeling paradigm predicts a sequence of binary labels to all tokens in a sentence.
Based on the basic formulations, various approaches have tried to enhance feature representations by using globally trained contextual word embeddings (Gao et al., 2018) or incorporating wider context with powerful encoders such as BiL-STM (Gao et al., 2018;Mao et al., 2019) and Transformers (Dankers et al., 2019;Su et al., 2020). Limitations and recent trends However, the above two paradigms have some limitations.
First, contextual information is mostly used to enhance the representation of the target word, but the interactions between the target word and its contexts are not explicitly modeled (Zayed et al., 2020;Su et al., 2020). To alleviate this, Su et al. (2020) proposed a new paradigm by viewing metaphor detection as a reading comprehension problem, which uses the target word as a query and captures its interactions with the sentence and clause. A concurrent work to this work (Choi et al., 2021) adopted a pre-trained contextualized model based late interaction mechanism to compare the basic meaning and the contextual meaning of a word.
Second, exploiting wider context will bring in more noise and may lose the focus. Fully depending on data-driven models to discover useful contexts is difficult, given the scale of available datasets for metaphor detection is still limited. The grammar structures, such as verb arguments, are important for metaphor processing (Wilks, 1978), but is not well incorporated into neural models. Stowe et al. (2019) showed that data augmentation based on syntactic patterns can enhance a standard model.  adopted graph convolutional networks to incorporate dependency graphs, but did (2) Figure 2: An example shows MrBERT's main architecture. MrBERT considers the representations of (1) the sentential global context, (2) the grammatical local context, and (3) the basic meaning of the verb as a distant context. Three context integration strategies for modeling contextual relations are adopted: (a) context concatenation, (b) context average, and (c) context maxout. Contextual relation r is modeled to indicate the probability of being metaphorical, where linear, bilinear and neural tensor models can be applied to capture interactions between the verb and its contexts. The relation-level and sequence-level predictions are jointly optimized.
not consider specific grammatical relations. It is interesting to further explore how to integrate explicit linguistic structures for contextual modeling.
This paper presents a new paradigm for verb metaphor detection to overcome these limitations, by viewing the task as a relation extraction task. We assume a target verb and its multiple contexts are entities, and metaphor detection is to determine whether a metaphorical relation holds between the verb and its contexts.
We will introduce the proposed model in Section 3. Before diving into details, we argue that viewing metaphor as a relation is reasonable and consistent with existing metaphor theories. According to Wilks (1978), metaphors show a violation of selectional preferences in a given context. The conceptual metaphor theory views metaphors as transferring knowledge from a familiar, or concrete domain to an unfamiliar, or more abstract domain (Lakoff and Johnson, 1980;Turney et al., 2011). The metaphor identification procedure (MIP) theory (Group, 2007) aims to identify metaphorically used words in discourse based on comparing their use in particular context and their basic meanings. All the theories care about a kind of relations between a target word and its contexts, which may help identify metaphors.

Metaphor-Relation BERT (MrBERT)
We propose the Metaphor-relation BERT (Mr-BERT) model to realize verb metaphor detection as a relation classification task. Figure 2 shows the architecture of MrBERT. We use the pre-trained language model BERT as the backbone model. There are three main procedures: (1) extract and represent contexts; (2) model the contextual relations between the target verb and its contexts; (3) manipulate the contextual relations for predicting the verb's metaphoricity.

Types of Contexts
A metaphor can result when a target word interacts with a certain part in a sentence. Previous work often explored individual context types, such as verb arguments through grammatical relations or the whole sentence/clause. Little work has attempted to summarize and combine different contexts.
We summarize the following contexts, which would help determine verbs' metaphoricity: • Global context: We view the whole sentence as the global context. A metaphorically used word may seem divergent to the meaning or topic of the sentence.
• Local context: We view the words that have a close grammatical relation to the target words as the local context, which is widely studied to capture selectional preference violations.
• Distant context: Motivated by the MIP theory, the difference between the contextual usage of a word and its basic meaning may indicate a metaphor so that we view the basic meaning of the target verb as a distant context.
Then, we have to extract and represent these contexts.

Context Extraction and Representation
We call the target verb's contexts as context components. To get the contextual or basic meanings of these components. we use the deep transformer models, such as BERT.
We first use Stanford dependency parser (Chen and Manning, 2014)  and [/obj], to explicitly label the boundaries of the target verb, its subject and object in each sentence. We also use [CLS] and [SEP ] to mark the whole sentence. For example, the marker inserted token sequence for the sentence He absorbed the costs for the accident is shown in Figure 2. The whole token sequence is fed into BERT's tokenizer, and then the transformer layers.
To get the contextual representations, we use the hidden states of the final transformer layer. For each marked component, we use the start marker (e.g., [subj]) or the averaged embedding between the start and the end markers (e.g., [subj] and [/subj]) as the component representation.
The contextual representation of the whole sentence is read from the final hidden state of [CLS].
To represent the basic meaning of the verb, we use the output from the BERT tokenizer to get the context independent verb representation. If word pieces exist, their averaged embedding is used.

Modeling the Contextual Relation
The relation between the target verb and one of its contexts is called a contextual relation. Our purpose is to utilize the contextual relation(s) to determine the metaphoricity of the verb.
The representations of the verb and a context component are denoted as v ∈ R d and c ∈ R k , respectively. We adopt three ways to explicitly define the form of the relation r for capturing the interactions between v and c.
• Linear model We use a parameter vector V r ∈ R d+k and a bias b r to represent the relation r, and the probability of the relation being metaphorical is computed according to where σ is the sigmoid function.
• Bilinear model We use a parameter matrix A r ∈ R d×k and a bias b r to represent the relation r: The components and the relation can interact more sufficiently with each other in this way.
• Neural tensor model We also exploit a simplified neural tensor model for relation representation:

Integrating Contextual Relations for Prediction
We focus on 3 types of contextual relations: • Verb-global relation The relation between the contextual representations of the verb v and the whole sentence c CLS .
• Verb-local relation The relation between the contextual representations of the verb v and its subject c subj or object c obj .
• Verb-distant relation The relation between the verb v and its basic meaning v bsc .
The representations of c subj , c obj , c CLS and v bsc can be obtained as described in Section 3.1.2. We try three ways to integrate the contextual relations. The first two ways build a combined context c first: • Context concatenation We can concatenate the representations of context components together as the combined context, i.e., c = c subj ⊕ c obj ⊕ c CLS ⊕ v bsc .
• Context average Similarly, we can use the averaged representation of all context components as the combined context, i.e., c = average(c subj , c obj , c CLS , v bsc ).
Then we compute the probability that the relation is metaphorical, i.e., p(r|v, c), where either linear, bilinear or neutral tensor model can be applied. The other way is to choose the most confident single prediction, i.e., • Context maxout The prediction is based on max{p(r|v, c)}, where c belongs to {c CLS , c subj , c obj , v bsc }.
To train the relation-level prediction model, we use binary cross-entropy as the loss function, where N is the number of training samples;ŷ i is the golden label of a verb withŷ i = 1 indicating a metaphorical usage andŷ i = 0 indicating a literal usage; y i is the probability of being metaphorical predicted by our model. We further combine relation-level and sequencelevel metaphor detection via multi-task learning. The sequence metaphor detection uses the hidden states of the final layer and a softmax layer for predicting the metaphoricity of each token. We use cross-entropy as the loss function and denote the average loss over tokens in training samples as L 1 . The final loss of MrBERT is L = L 0 + L 1 .  (Steen, 2010) dataset. It is the largest publicly available metaphor detection dataset and has been used in metaphor detection shared tasks (Leong et al., 2018. This dataset has a training set and a test set. Previous work utilized the training set in different ways (Neidlein et al., 2020). We use the preprocessed version of the VUA dataset provided by Gao et al. (2018). The first reason is that this dataset has a fixed development set so that different methods can adopt the same model selection strategy. The second reason is that several recent important methods used the same dataset (Mao et al., 2018; (Gao et al., 2018). 50,175 and 5,873 tokens are used for evaluating All-POS and Verb tracks, respectively.
2019; Stowe et al., 2019;. Therefore it is convenient for us to compare the proposed method with previous work. There are two tracks: Verb and All-POS metaphor detection. Some basic statistics of the dataset are shown in Table 1. We focus on the Verb track since we mainly model metaphorical relations for verbs. We use MrBERT's relation-level predictions for the verb track and use its sequence labeling module to deal with the All-POS track. mad et al., 2016) and TroFi (Birke and Sarkar, 2006) are two relatively smaller datasets compared with VUA. Only a single target verb is annotated in each sentence. We will report the results on MOH-X and TroFi in three settings: zero-shot transfer, re-training and fine-tuning. Metrics The evaluation metrics are accuracy (Acc), precision (P), recall (R) and F1-score (F1), which are most commonly used in previous work.

Baselines
We compare with the following approaches. • Le et al. (2020) propose a multi-task learning approach with graph convolutional neural networks and use word sense disambiguation as an auxiliary task.  Notice that the systems participating in the VUA metaphor detection shared tasks (Leong et al., 2018 can use any way to manipulate the training set for model selection and ensemble learning so that the reported results in the task report are not directly comparable to us. The results of Deep-Met and MelBERT are based on the single model evaluation in (Choi et al., 2021). The first four baselines do not utilize pre-trained language models, while the last three baselines use BERT or RoBERTa. These baselines support comprehensive comparisons from multiple aspects.

Parameter Configuration
During context component extraction, if the target verb does not have a subject or an object, we use a fixed zero vector instead. We use the bert-baseuncased model and the standard tokenizer. The values of hyper-parameters are shown in Table 2.
For MrBERT, we view the ways of component representation (start marker or averaged embedding, see Section 3.1.2), relation modeling (linear, bilinear, and neural tensor (NT)) models, see Section 3.2) and context integration (context concatenation, average and maxout, see Section 3.3) strategies as hyper-parameters as well. We run each model for 10 epoches, and choose the best combination according to the performance on the development set. The best combination uses the averaged embeddings, the bilinear model and the context average strategy, and it will represent Mr-BERT for performance report in Section 4.2. Table 3 shows the results of the baselines and Mr-BERT. Except for (Gao et al., 2018)-CLS, all methods use the annotation information of all tokens. For the All-POS track, we report the performance on either all POS tags or 4 main POS tags for comparison with previous work.

Main Results on VUA Dataset
We can see that MrBERT achieves superior or competitive performance compared with previous work on verb metaphor detection. The use of pretrained language models improves the performance in general, compared with several LSTM based methods. Recent proposed models, such as Deep-Met, MelBERT and MrBERT, gain further improvements compared with BERT-SEQ.
MrBERT outperforms (Stowe et al., 2019) and  largely. The two baselines attempt to make use of grammar information, through data augmentation or graph neural networks. In contrast, MrBERT provides a simple yet effective way to incorporate verb arguments and new contexts into a pre-trained language model.
MrBERT also has competitive performance compared with DeepMet and MelBERT. We share the similar idea to enhance interactions between the target verb and its contexts, but implement in different ways. DeepMet and MelBERT base on the pretrained model RoBERTa and use additional POS or FGPOS information. Moreover, these two models are trained for every token so that the training might be more sufficient. In contrast, we mainly model metaphorical relation for verbs. This is perhaps also the reason that on the All-POS metaphor detection track, MrBERT has slightly worse results compared with MelBERT. However, our model is flexible and can be applied to tokens with other POS tags as well. We leave this as future work.

Analysis
We further analyze the effects of modeling contextual relations from several aspects. Relation modeling and context integration strategies Table 4 shows the results of different   combinations of relation modeling and context integration strategies. BERT-SEQ here refers to the re-trained baseline with model selection based on the performance on the development set, and surpasses the reported results in (Neidlein et al., 2020). We can see that most combinations outperform BERT-SEQ, and have consistent performance. The bilinear and neural tensor models perform better than the linear model. This means that sophisticated relation modelling techniques can benefit the performance.
Context average and context maxout strategies perform better than context concatenation. The reason may be that context concatenation is more difficult to be trained due to more parameters. Table 5 shows the performance of MrBERT when it considers the global context (MrBERT-G), the global and the local contexts (MrBERT-GL), and the full model with the distant context (MrBERT-GLD). Each model is trained separately, with the same model selection procedure. We can see that integrating multiple contexts leads to better performance.  MrBERT explicitly incorporates verb arguments through grammatical relations as the local context, which differs from other methods. We are interested in the effect of such information.

Effects of different contexts
We analyze MrBERT-G and MrBERT-GL. Table 6 shows the distribution of auto-extracted verbsubject and verb-direct object relations in the VUA test dataset. ∆F 1 values indicate the improvements of MrBERT-G compared with BERT-SEQ in F1. We can see that MrBERT-G outperforms BERT-SEQ mainly when verb's arguments are incomplete. For verbs with complete verb-subject and verb-direct object structures, little improvement is gained. Table 7 shows the corresponding performance of MrBERT-GL. Better performance is obtained for verbs with all status of grammatical relations. The improvement on verbs in the lower right corner is obvious. In these cases, the verbs are usually intransitive verbs or used as a noun or an adjective. The benefit of involving grammatical relations may be that it helps keep a dynamic and balanced focus between the global and local contexts according to the signals expressed by the grammatical structure.
Intuitively, the effect of incorporating grammatical relations should be more obvious for metaphor detection in long sentences, since the local and global contexts are quite different. To verify this, we divide sentences in the test dataset into bins   according to the number of clauses. Figure 3 confirms our hypothesis that MrBERT obtains larger improvements on sentences with more clauses, indicating that incorporating grammatical relations can help filter noisy information. Finally, the use of distant context obtains a further improvement. This observation is consistent with the conclusion of (Choi et al., 2021). It also indicates that the BERT tokenizer's embedding can be used to approximate the representation of the target verb's basic meaning. Table 8 shows the results on the MOH-X and TroFi datasets.

Results on MOH-X and TroFi Datasets
In the zero-shot transfer setting, MrBERT obtains better performance compared with DeepMet and MelBERT on both datasets. The performance of DeepMet and MelBERT is read from (Choi et al.,   In the 10-fold cross-validation setting, the retrained MrBERT can also obtain superior or competitive results compared with previous work. If we continue to fine-tune the pre-trained MrBERT on the target datasets, better performance can be obtained, especially on the MOH-X dataset.

Related Work
Metaphor detection is a key task in metaphor processing (Veale et al., 2016). It is typically viewed as a classification problem. The early methods were based on rules (Fass, 1991;Narayanan, 1997), while most recent methods are data-driven. Next, we summarize data-driven methods from the perspective of context types that have been explored.
Grammatical relation-level detection This line of work is to determine the metaphoricity of a given grammatical relation, such as verbsubject, verb-direct object or adjective-noun relations . The key to this category of work is to represent semantics and capture the relation between the arguments.
Feature-based methods are based on handcrafted linguistic features. Shutova and Teufel (2010b) proposed to cluster nouns and verbs to construct semantic domains. Turney et al. (2011) and Shutova and Sun (2013) considered the abstractness of concepts and context. Mohler et al. (2013) exploited Wikipedia and WordNet to build domain signatures. Tsvetkov et al. (2014) combined abstractness, imageability, supersenses, and cross-lingual features.  exploited attribute-based concept representations.
The above handcrafted features heavily rely on linguistic resources and expertise. Recently, distributed representations are exploited for grammatical relation-level metaphor detection. Distributed word embeddings were used as features (Tsvetkov et al., 2014) or to measure semantic relatedness (Gutiérrez et al., 2016;Mao et al., 2018). Visual distributed representations were also proven to be useful . Rei et al. (2017) designed a supervised similarity network to capture interactions between words. Song et al. (2020) modeled metaphors as attribute-dependent domain mappings and presented a knowledge graph embedding approach for modeling nominal metaphors. Zayed et al. (2020) identified verb-noun and adjective-noun phrasal metaphoric expressions by modeling phrase representations as a context.
Token-level detection Another line of work formulates metaphor detection as a single token classification or sequence labeling problem (Do Dinh and Gurevych, 2016;Gao et al., 2018;Mao et al., 2019). These approaches are mostly based on neural network architectures and learn representations in an end-to-end fashion. These approaches depend on token-level human annotated datasets, such as the widely used VUA dataset (Steen, 2010).
BiLSTM plus pre-trained word embeddings is one of the popular architectures for this task (Gao et al., 2018;Mao et al., 2019). Recently, Transformer based pre-trained language models become the most popular architecture in the metaphor detection shared task . Multitask learning (Dankers et al., 2019;Rohanian et al., 2020; and discourse context (Dankers et al., 2020) have been exploited as well. Discussion The grammatical relation-level and token-level metaphor detection consider different aspects of information. Grammatical relations incorporate syntactic structures, which are well studied in selectional preferences (Wilks, 1975(Wilks, , 1978 and provide important clues for metaphor detection. However, sentential context is also useful but is ignored. In contrast, token-level metaphor detection explores wider context and gains improvements, but syntactic information is neglected and as discussed in (Zayed et al., 2020), interactions between metaphor components are not explicitly modeled.
This paper aims to combine the grammatical relation-level, token-level and semantic-level information through pre-trained language model based contextual relation modeling.

Conclusion
This paper presented the Metaphor-relation BERT (MrBERT) model for verb metaphor detection. We propose a new view to formulate the task as modeling the metaphorical relation between the target verb and its multiple context components, i.e., contextual relations. We propose and evaluate various ways to extract, model and integrate contextual relations for metaphoricity prediction. We conduct comprehensive experiments on the VUA dataset. The evaluation shows that MrBERT achieves superior or competitive performance compared with previous methods. We also observe that incorporating grammatical relations can help balance local and global contexts, and the basic meaning of the verb as a distant context is effective. Further experiments on small datasets MOH-X and TroFi also show good model transferability of MrBERT.