CATE: A Contrastive Pre-trained Model for Metaphor Detection with Semi-supervised Learning

Metaphors are ubiquitous in natural language, and detecting them requires contextual reasoning about whether a semantic incongruence actually exists. Most existing work addresses this problem using pre-trained contextualized models. Despite their success, these models require a large amount of labeled data and are not linguistically-based. In this paper, we proposed a ContrAstive pre-Trained modEl (CATE) for metaphor detection with semi-supervised learning. Our model first uses a pre-trained model to obtain a contextual representation of target words and employs a contrastive objective to promote an increased distance between target words’ literal and metaphorical senses based on linguistic theories. Furthermore, we propose a simple strategy to collect large-scale candidate instances from the general corpus and generalize the model via self-training. Extensive experiments show that CATE achieves better performance against state-of-the-art baselines on several benchmark datasets.


Introduction
Conceptual metaphors are figurative languages widely used in our daily communication, implying a mapping between two conceptual domains (Lakoff and Johnson, 2008). At a linguistic level, metaphor is defined as a linguistic expression representing other concepts rather than taking literal meanings of words in context (Lagerwerf and Meijers, 2008). For instance, in the sentence "I have digested all this information," the word digested does not literally mean converting food into absorbable substances. Instead, this word means "arrange and integrate in the mind" in the context. 1 This metaphor conceptualizes the concept of ideas in terms of the properties of food. Metaphorical associations as such are broad generalizations that allow us to project knowledge and inferences across domains and are beneficial for various downstream NLP applications, such as machine translation (Shi et al., 2014), sentiment analysis (Cambria et al., 2017;Dankers et al., 2019), and dialogue systems (Dybala and Sayama, 2012).
Given the prevalence of metaphors in human communication, the effective detection of metaphors plays an essential role in natural language understanding. Hence, many efforts have been devoted to metaphor detection (MD), which aims to identify metaphorical expressions in a text automatically. Most previous methods (Mason, 2004;Turney et al., 2011;Tsvetkov et al., 2014; for MD are based on various hand-crafted linguistic features and rely on manually annotated resources to extract them. Recently, significant progress has been made in applying deep learning techniques for MD (Wu et al., 2018;Gao et al., 2018;Mao et al., 2019;Rohanian et al., 2020;Le et al., 2020). These methods directly embed textual semantic information into a low-dimensional space by deep neural networks. Nevertheless, these methods are unable to model the multiple meanings of polysemous words in context (Choi et al., 2021). With the rapid development of contextualized representations, a number of methods (Su et al., 2020;Choi et al., 2021) adopt pre-trained language models to effectively capture context-dependent information with respect to the target words and fine-tune them to obtain state-of-the-art performances for MD.
Although these pre-trained models have achieved promising results, several problems remain unsolved. First, the current models lack the discrimination between the literal meaning and non-literal meaning of the target words, which can be enhanced by analogical comparison in the specific context based on Metaphor Identification Procedure (MIP) (Pragglejaz Group, 2007). Second, one challenge for fine-tuned language models is they still require large amounts of labeled data for obtaining state-of-the-art performances on downstream tasks (Du et al., 2020;Yu et al., 2020;Karamanolakis et al., 2021). However, due to the expensive and labor-intensive labeling, existing public MD datasets are relatively small. In addition, labeling metaphorical words can be influenced by subjective input and may need expert knowledge (Tsvetkov et al., 2014), which poses a significant challenge for metaphor detection.
The above challenges motivate us to propose a ContrAstive Pre-Trained ModEl (CATE) for metaphor detection, using a contrastive objective to model the distance between the target word's literal and metaphorical senses, enhancing the model generalization performance via self-training with unlabeled data generated by a simple strategy. Firstly, we utilize pre-trained models (i.e., BERT and RoBERTa) to capture contextual information about a target word in the sentence. If the target word is a metaphor, its semantic meaning is context-specific and different from its literal meaning. The word's literal meaning can be described through non-metaphorical instances. Therefore, we incorporate a contrastive objective to enhance contextual representations between the literal and metaphorical meaning of a target word to make it more distinguishable, in which way the classifier can make a more informed decision. To address the label scarcity issue, we propose a simple target-based generating strategy to automatically generate training data inspired by a distantly supervised paradigm (Mintz et al., 2009;Hoffmann et al., 2011). Concretely, if a given word serves as the detection target in a sentence, all sentences containing this word in a specific corpora are retrieved and regarded as candidate instances. To expand the training data, we use the pre-trained model to generate pseudo-labels for these candidate instances and incorporate them into training data, where the pre-trained model is first fine-tuned on the original training set, as shown in Figure 1. We update the pseudo-labels and the model iteratively by selftraining for improving the generalization power.
In summary, the contributions of this paper are as follows: (1) We propose a novel pre-trained model with a contrastive objective for capturing the semantic incongruence in metaphors based on MIP linguistic theories. (2) To our best knowledge, this is the first attempt to combine semi-supervised learning with self-training to alleviate the label scarcity issue for MD. (3) Empirically, we perform experiments on widely used datasets to verify the effectiveness of our approach. Experimental results show that our approach obtains state-of-the-art performance over several benchmark datasets.

Related Work
Early approaches mainly use a variety of linguistic features to detect metaphors, such as Part of Speech, unigrams (Klebanov et al., 2014), concreteness/abstractness (Turney et al., 2011;Tsvetkov et al., 2014), WordNet supersenses (Klebanov et al., 2016), and sensory features (Tekiroglu et al., 2015;, etc. They rely heavily on numerous carefully designed feature engineering. In recent years, various models have been widely used in MD based on end-to-end neural architectures. Wu et al. (2018) reformat the MD task as a sequence labeling problem and combine CNN and LSTM layers with ensemble learning to generate the best performance in the NAACL-2018 metaphor shared task (Leong et al., 2018). Subsequently, Gao et al. (2018) presented simple BiL-STM augmented with contextualized word representation, which achieved better results. Mao et al. (2019) further adopted two linguistic theories on top of the structure of (Gao et al., 2018). In addition, some approaches employed multi-task learning to transfer knowledge from the related tasks and resources to improve the performance of MD (Do Dinh et al., 2018;Dankers et al., 2019;Rohanian et al., 2020;Le et al., 2020). These neural models are capable of properly capturing the relations between metaphors and their contexts without linguistic analyses. However, the superficial structures make them difficult to represent different aspects of words in context. The diagram of CATE model with two stages. In stage I, the proposed pre-trained model is fine-tuned with labeled data using a contrastive objective. In stage II, we design a target-based generating strategy (TGS) to collect unlabeled data and adopt self training to iteratively augment the training data by generating pesudo-labels.
Recently proposed pre-trained language models (Devlin et al., 2018;Liu et al., 2019;Yang et al., 2019) have shown dramatic improvements on several NLP tasks with appropriate fine-tuning. Therefore, some efforts (Maudslay et al., 2020;Gong et al., 2020;Su et al., 2020;Choi et al., 2021) are made to leverage the strong expressive power of pre-trained models, such as BERT, RoBERTa, to effectively capture general semantics and contextdependent information of target words for improving the performance of metaphor detection. Despite their success, one bottleneck for fine-tuning pretrained models is the requirement of labeled data. When labeled data are scarce, the fine-tuned models often suffer from degraded performance, and the large number of parameters can lead to severe overfitting (Xie et al., 2019;Du et al., 2020;Yu et al., 2020). However, it is time-consuming and human-intensive to manually annotate large-scale training data for MD.

Proposed Method
The MD task is to predict whether a target word in a given sentence is metaphorical or literal. Some previous work (Wu et al., 2018;Gao et al., 2018;Mao et al., 2019) regards metaphor detection as a sequence labeling task that predicts the metaphoricity of each word in a given sentence. Nevertheless, this format introduces the noise of treating all nontarget words as literal, which negatively impacts the model learning the difference between literal and metaphorical words (Mao et al., 2019). In this paper, we convert the MD task as a classification task based on the target word, like (Le et al., 2020;Choi et al., 2021). Formally, given a sentence S = {w 1 , w 2 , ..., w n } with n words and a target word w t ∈ S, the task involves predicting a binary label l t ∈ {0, 1} to indicate the metaphoricity (i.e., metaphorical or literal) of the target word w t . Figure 2 gives an overview of CATE.

Pre-trained Model for MD
Given a sentence S with target word w t , our model leverages the power of BERT as a sentence encoder, which is particularly attractive to this task due to its strong expressive power to capture general semantics and contextual information effectively. Following (Devlin et al., 2018), we insert two special tokens '[CLS]' and ' [SEP]', at the beginning and end of the input sentence, respectively. We feed the sentence S with two special tokens into the BERT backbone to obtain the final hidden states H: Our goal is to identify whether the semantic meaning of the target word w t within the sentence S is metaphorical or not. We should calculate the context-specific representation of w t to classify. The pre-training models (e.g., BERT) usually employ the WordPiece techniques (Wu et al., 2016;Radford et al., 2019) to tokenize the word to reduce the size of the vocabulary so that a word may be divided into multiple word pieces. For example, the word digested is segmented into two word pieces "digest" and "##ed". Hence, we use the average operation to obtain a fixed-sized feature vector. Assuming that the hidden states corresponding to the subwords of the target word w t are from h i to h j , we average these hidden states: where c is the contextualized feature of target word w t . Then we feed c into an MLP layer with tanh activation function and a softmax layer to predict the metaphoricity of the target word w t . This process can be mathematically formalized as follows: The parameters are updated by minimizing the cross-entropy loss between the true label y and the metaphoricity distribution p: where M is the number of instances in the dataset.

Contrastive Objective
Metaphor Identification Procedure (MIP) dictates that a word is identified as a metaphor if the literal meaning of a word contrasts with the meaning that word adopts in this context (Pragglejaz Group, 2007). According to MIP, the contrast between the contextual and literal meaning of a word serves as an important criterion for detecting its metaphoricity. Although some work (Mao et al., 2019;Choi et al., 2021) has attempted to explore the contrastive relationship between literal and contextual meaning corresponding to target word by simply concatenating the semantic features extracted from different branches of models, it remains to be unclear whether this contrastive relationship is effectively modeled. This section explicitly incorporates a contrastive objective to capture this contrastive relationship, making the classifier more distinguishable. The objective enables the metaphorical instances of a target word to have closer semantic representations and keep literal instances separated. As shown in the shaded green part in Figure 2, the target word "digest" in both instances a and b is metaphorical and means "arrange and integrate some information in the mind", rather than its literal meaning "converting food into absorbable substances" in instance c. Therefore, we expect the contextual representation of the target word "digest" in sentences a and b to be more similar, and be far away from the representation in sentence c.
Formally, given a sentence S a with target word w t as an anchor, S p is a positive example with target word w t belonging to the same class as S a in batch B, while S n is a negative example with target word w t belonging to another class in batch B. We calculate their contextualized features c a , c p and c n by Eq. (2), respectively. The contrastive objective is defined: where [·] + denotes the function f (x) = max(0, x); d(·, ·) denotes the L2-normalized euclidean distance; γ controls the margin.
This loss means capturing similarities between examples of the same class and contrasting them with examples from other classes. When the samples are from different classes (that is, one is metaphorical and the other is literal), the contrastive loss increases the distance between them and keep them apart by at least a margin γ. Modelling the distance in embedding space between the target word's literal and metaphorical semantics is an important characteristic for metaphor detection.

Semi-supervised Learning
The scarcity of labeled data is another challenge for MD. Currently, only relatively small training sets are available for MD, and labeling metaphorical words requires manual efforts from metaphor experts, which is time-consuming and labor-intensive. Although recent advances on pre-trained models reduce the annotation workload, they still require large amounts of labeled data to avoid overfitting (Du et al., 2020). In this section, we propose a simple strategy called Target-based Generating Strategy (TGS) to construct a large-scale training dataset with no need of metaphor experts or sophisticated pre-defined rules.
Target-based Generating Strategy (TGS) The TGS is based on a heuristic process that if a word serves as the detection target in a sentence, all other sentences containing this word in a specific corpus serve as potential candidate instances. This strategy effectively obtains a large-scale candidate set U based on the target words in the labeled data as heuristic seeds, which can cover more topics without any special manual design. It is natural to use the fine-tuned model to predict the labels of candidate instances and then select high-confidence samples as the expanded data, but this way relies on the performance of the pre-trained model, which may lead to prediction bias and introduce noise.
Self-Training (ST) To alleviate the noise in U, we adopt self-training (Rosenberg et al., 2005;Lee et al., 2013) to generate pseudo-labels for the candidate instances by the fine-tuned model and incorporate them into the training set, with which the pseudo-labels and the model are updated in an iterative manner. There are two alternatives for generating the pseudo-labels for candidate instances, namely hard labeling (Lee et al., 2013) and soft labeling (Xie et al., 2016). Hard labeling selects the highest-confidence prediction as the class label for each instance, which is prone to cause error propagation when having the wrong prediction (Yu et al., 2020). Alternatively, we choose to generate soft pseudo-labelsŷ i ∈ R K for each instance u i ∈ U: where f j = i p ij is the sum over soft frequencies of class j, p ij is j-th class prediction of u i . Eq. (6) derivesŷ i by strengthening high-confidence predictions while reducing low-confidence ones via squaring and normalizing the current predictions, and it retains more information than hard labels. We define the ST objective as a KL-divergence loss between the pseudo-labels distributionsŶ and the model's current prediction P :

Traning Procedure of CATE
The overall objective function of CATE includes contrastive loss L co , classification loss L cls for labeled data and KL loss for unlabeled data U: where α and β are hyperparameters for balancing the strength of the contrastive loss and KL loss, respectively. CATE includes a two-stage training procedure: In the first stage, we fine-tune the pretrained model with the first two terms of Eq. (8) using the labeled data, which can significantly learn contrastive relationship in metaphors and improve the quality of prediction for MD. Then we use the fine-tuned model to predict the soft pseudo-labels for all unlabeled data collected by TGS. In the second stage, we apply a self-training strategy to augment the training data with pseudo-labeled data and update the pre-trained model in an iterative manner. During self-training, we iteratively compute soft pseudo-labels based on current predictions and refine model parameters with Eq. (8). The procedures are summarized in Algorithm 1.

Experimental Setup
Datasets To evaluate the effectiveness of our model, we conduct experiments on three widelystudied datasets: (1) VUA (Steen, 2010) is currently the largest publicly available dataset used by NAACL-2018 Metaphor Shared Task. Follow previous work (Gao et al., 2018;Mao et al., 2019), we examine our model on two tracks, i.e., VUA ALL POS and VERB metaphor detection. (2) MOH-X ) is a verb metaphor detection dataset that only a single target verb is labeled in each sentence. The sentences are sampled from WordNet. (3) TroFi (Birke and Sarkar, 2006) is also a verb metaphor detection dataset, and the sentences are extracted from the 1987-89 Wall Street Journal Corpus Release 1. Statistics of these datasets are listed in Table 1.
Baselines we compare CATE against stateof-the-art baselines in metaphor detection, including RNN_CLS (Gao et al., 2018): a classification model combining attention-based BiLSTM and ELMo embedding. RNN_SEQ_ELMo and RNN_SEQ_BERT (Gao et al., 2018): a sequence labeling model with attention-based BiLSTM combining the ELMo embedding and BERT embedding, respectively. RNN_HG (Mao et al., 2019):  (Choi et al., 2021): utilize RoBERTa as backbone and model the contextual meaning and literal meaning based on siamese architecture. Implementation Details In experiment, we first collect a target word set in all datasets as triggers and use TGS to recall large-scale target-related candidate instances from the common corpus for semi-supervised learning. We use Wikipedia as the knowledge base because it contains a wide variety of domains which makes it an ideal generalpurpose corpus and is usually easily and cheaply accessible. We extract and filter text from the English Wikipedia dump † to construct a large-scale candidate set and apply the NLTK package (Bird et al., 2009) to turn documents into sentences and perform deduplication. Besides, we filter sentences longer than 150 words due to potential noise and memory limitations.
Following (Su et al., 2020;Choi et al., 2021), we † https://dumps.wikimedia.org/enwiki/ 20210201/ use RoBERTa (Liu et al., 2019) as the realization of BERT. The number of transformer layers is 12, and the hidden size is 784. We use AdamW (Peters et al., 2019) optimizer with a learning rate of 3e-5 to update the parameters. The number of training epochs is 5, and the batch size is 32. The margin γ in contrastive loss is set to 1.0. The hyper-parameters α and β are set to 0.2 and 0.05, respectively. We perform 10-fold cross-validation on MOH-X and TroFi and split the VUA datasets into training, validation, and test sets the same as the previous work (Gao et al., 2018;Mao et al., 2019) for the fair comparison.

Overall Results
We report the results in Table 2 in terms of accuracy, precision, recall, and F1-score, where F1score is the main measurement for metaphor detection (Mao et al., 2019). We can found that CATE achieves strong performance on all datasets, is superior to existing models on 3 out of 4 datasets in terms of F1-score (improved by 0.5%, 4.5% and 1.3% compared with the previous best model in VUA ALL POS, MOH-X and TroFi, respectively), and achieves similar performance on VUA VERB with MelBERT. Noteworthily, DeepMet and Mel-BERT additionally utilize linguistic features, such as POS features in their model, while CATE does not use any linguistic features. Meanwhile, it can be observed that the improvement of our model is more obvious on small-scale datasets (i.e., MOH-X and TroFi). The reason is that the massive parameters in the pre-trained model easily lead to overfitting of the model when only relatively small training sets are available. However, CATE can make full use of a large number of unlabeled data collected by the proposed target-based generating strategy and improve the model generalization by self-training. Compared with RNN_HG, which also considers the MIP principle, our model significantly outperforms it because ours explicitly captures the contrast between the literal and metaphorical meaning of target words by a contrastive objective. Not surprisingly, the approaches based on pre-trained language models (e.g., CATE, Mel-BERT, DeepMet) are consistently superior to the RNN-based models (e.g., RNN_CLS, RNN_HG, RNN_MHCA) due to the strong expressive power of pre-trained models to encode rich semantic and contextual information into the representations.    of Table 2, each component is important for the proposed model as excluding any of them would hurt the performance significantly. When the selftraining is removed, the F1-score respectively drops by 2.0% and 0.9% on small-scale MOH-X and TroFi datasets, which demonstrates the necessity of integrating semi-supervised learning to improve the generalization performance. The contrastive objective learns the difference between the target word's literal and metaphorical semantics and is also beneficial to our model.

Model Analysis
VUA Breakdown Analysis Table 3 and 4 respectively report the breakdown of performance by different genres and open-class words based on the VUA ALL POS test dataset, which in line with Leong et al. (2018) and Mao et al. (2019). CATE shows very promising overall results against other competitive baselines in both breakdown datasets. In Table 3, all models achieve better results on Academic due to the expressions used in academic articles are formal and normative with abundant context. Particularly, CATE presents a substantial improvement in terms of F1-score against the second best with a gain of 1.3% and 0.5% on Conversation and Fiction, respectively. This is meaningful because Conversation and Fiction are more challenging and have lower F1-score than other genres due to their fragmented or rare expressions, such as er, yeah, na. We speculate that the reason for the improvement of CATE is that the target-based generating strategy has the ability to automatically construct diverse training data from Wikipedia containing different topics and avoid the model tend to be biased toward a specific domain.
In Table 4, all models perform better results on Verb as it has the largest training instances. Properly, CATE provides strong performance on almost all open-class words and achieves large improvements against MelBERT in Verb (0.6%), Adjective (1.4%), Adverb (0.9%) in terms of F1-score. Embedding Visualization In Figure 3, we visualize TroFi sample contexual embeddings in Eq. (2) for specific target words attack and cool. As shown in Figure 3 (a)(c), when the contrastive objective is removed, the literal and metaphorical representations are mixed together and indistinguishable. Based on the MIP principle, a metaphor is identified if the literal meaning of the target word contrasts with its contextual meaning. As excepted, the proposed contrastive objective explicitly extends the distance between the target word's literal and metaphorical meanings in embedding space and learns more compact representations for data from the same class, as shown in Figure 3 (b)(d).
Impact of available labeled data We further investigate the effectiveness of self-training when using different ratios of supervised data. The results on MOH-X and TroFi are reported in Figure 4. As the labeled data size continues to increase, the gain of self-training gradually decreases. When little supervised data is available, self-training can be regarded as a regularizer to effectively improve the prediction ability and generalization of the model. Hyperparameters Discussion We examine the effects of hyper-parameters α and β on the MOH-X dataset, as shown in Figure 5. With the increase of parameter α or β, the model performance increases first and then decreases. When α is too large, it easily leads to overly penalize the distance and overlooks the metaphorical associations between different senses, whereas when β is too large, it also deteriorates performance due to injecting too much noise from unlabeled data.

Conclusion
This paper takes advantage of self-training and designs a simple but effective metaphor detection model based on the pre-trained backbone to capture the contextualized features. To be specific, we incorporate a contrastive objective into the model to capture the semantic incongruence in metaphors and use a simple strategy to automatically construct substantial training data ready for self-training. The evaluation on multiple benchmarks has shown that our model can achieve state-of-the-art performance. In future work, we plan to explore how to use unlabeled data more effectively and discover potentially valuable metaphor examples to reduce efforts of manual annotation.