Adversarial Multi-task Learning for End-to-end Metaphor Detection

Metaphor detection (MD) suffers from limited training data. In this paper, we started with a linguistic rule called Metaphor Identification Procedure and then proposed a novel multi-task learning framework to transfer knowledge in basic sense discrimination (BSD) to MD. BSD is constructed from word sense disambiguation (WSD), which has copious amounts of data. We leverage adversarial training to align the data distributions of MD and BSD in the same feature space, so task-invariant representations can be learned. To capture fine-grained alignment patterns, we utilize the multi-mode structures of MD and BSD. Our method is totally end-to-end and can mitigate the data scarcity problem in MD. Competitive results are reported on four public datasets. Our code and datasets are available.


Introduction
Metaphor involves a mapping mechanism from the source domain to the target domain, as proposed in Conceptual Metaphor Theory (Lakoff and Johnson, 2008).e.g.The police smashed the drug ring after they were tipped off .
Smash in the above sentence means "hit hard" literally (source domain).However, it is employed in a creative way, indicating "overthrow or destroy" (target domain).The mapping from the source to the target makes the word a metaphor.
Understanding metaphors in human languages is essential for a machine to dig out the underlying intents of speakers.Thus, metaphor detection and understanding are crucial for sentiment analysis (Cambria et al., 2017), and machine translation (Mao et al., 2018), etc.
Metaphor detection (MD) requires a model to predict whether a specific word is literal or metaphorical in its current context.Linguistically, if there is a semantic conflict between the contextual meaning and a more basic meaning, the word is a metaphor (Crisp et al., 2007;Steen, 2010;Do Dinh et al., 2018).The advent of large Pretrained Language Models has pushed the boundaries of MD far ahead (Devlin et al., 2019;Liu et al., 2019b).However, MD suffers from limited training data, due to complex and difficult expert knowledge for data annotation (Group, 2007).
Recently, Lin et al. (2021) used self-training to expand MD corpus, but error accumulation could be a problem.Many researchers used various external knowledge like part of speech tags (Su et al., 2020;Choi et al., 2021), dictionary resources (Su et al., 2021;Zhang and Liu, 2022), dependency parsing (Le et al., 2020;Song et al., 2021), etc., to promote MD performance.These methods are not end-to-end, thus they impeded continuous training on new data.
To address the data scarcity problem in MD, we propose a novel task called basic sense discrimination (BSD) from word sense disambiguation (WSD).BSD regards the most commonly used lexical sense as a basic usage, and aims to identify whether a word is basic in a certain context.Both BSD and MD need to compare the semantic difference between the basic meaning and the current contextual meaning.Despite the lack of MD data, we can distill knowledge from BSD to alleviate data scarcity and overfitting, which leads to the usage of multi-task learning.
We design the Adversarial Multi-task Learning Framework (AdMul) to facilitate the knowledge transfer from BSD to MD. AdMul aligns the data distributions for MD and BSD through adversarial training to force shared encoding layers (for example, BERT) to learn task-invariant representations.Furthermore, we leverage the internal multi-mode structures for fine-grained alignment.The literal senses in MD are forcibly aligned with basic senses in BSD, which can push the literal senses away from the metaphorical.Similarly, the non-basic senses in BSD are aligned with metaphors in MD, which enlarges the discrepancy between basic and non-basic senses to enhance model performance.
The contributions of this paper can be summarized as follows: • We proposed a new task, basic sense discrimination, to promote the performance of metaphor detection via a multi-task learning method.The data scarcity problem in MD can be mitigated via knowledge transfer from a related task.
• Our proposed model, AdMul, uses adversarial training to learn task-invariant representations for metaphor detection and basic sense discrimination.We also make use of multimode structures for fine-grained alignment.
Our model is free of any external resources, totally end-to-end, and can be easily trained.
• Experimental results indicate that our model achieves competitive performance on four datasets due to knowledge transfer and the regularization effect of multi-task learning.Our zero-shot transfer result even surpasses finetuned baseline models.

Related Work
Metaphor Detection: Metaphor detection is a popular task in figurative language computing (Leong et al., 2018(Leong et al., , 2020)).With the progress of natural language processing, various methods have been proposed.Traditional approaches used different linguistic features like word abstractness, word concreteness, part of speech tags and linguistic norms, etc., to detect metaphors (Shutova and Sun, 2013;Tsvetkov et al., 2014;Beigman Klebanov et al., 2018;Wan et al., 2020).These methods are not end-to-end and rely heavily on feature engineering.
The rise of deep learning boosted the advancement of metaphor detection significantly.Gao et al. (2018), Wu et al. (2018) and Mao et al. (2019) used RNN and word embeddings to train MD models.Recently, lots of works combined the advantages of pre-trained language models and external resources to enhance the performance of metaphor detection (Su et al., 2020(Su et al., , 2021;;Choi et al., 2021;Song et al., 2021;Zhang and Liu, 2022).Though great improvements have been made, these models still suffer from the lack of training data, which is well exemplified by their poorer performance on small datasets.Multi-task Learning: Multi-task learning (MTL) can benefit a target task via related tasks.It has brought great success in computer vision and natural language processing.MTL learns universal representations for different task inputs, so all tasks share a common feature space, where knowledge transfer becomes possible.Previous studies trained MTL models by deep neural networks like CNN or RNN, achieving promising results in text classification (Liu et al., 2017;Chen and Cardie, 2018).Liu et al. (2019a) and Clark et al. (2019) combined MTL framework with BERT (Devlin et al., 2019), obtaining encouraging results on multiple GLUE tasks.There are some other successful MTL applications in machine translation (Dong et al., 2015), information extraction (Nishida et al., 2019), and sentiment analysis (Liang et al., 2020), etc. Dankers et al. (2019) applied MTL to study the interplay of metaphor and emotion.Le et al. (2020) combined WSD and MD for better metaphor detection results.However, to the best of our knowledge, we are the first to use adversarial MTL for metaphor detection based on the linguistic nature of metaphors.
3 Proposed Method

Metaphor Identification Procedure
Metaphor Identification Procedure (MIP) (Crisp et al., 2007) is the most commonly used linguistic rule in guiding metaphor detection.It is originally the construction guideline of VU Amsterdam Metaphor Corpus.MIP indicates that if a word contrasts with one of its more basic meanings but can be understood in comparison with it, then the word is a metaphor.A more basic meaning is more concrete, body-related, more precise, or historically older (Steen, 2010;Do Dinh et al., 2018).
Some researchers have pointed out that when a word is used alone, it is very likely to depict a more basic meaning (Choi et al., 2021;Song et al., 2021).We concatenate the target word and the sentence as input.In the input, the first segment is the target used alone, presenting a more basic meaning.The second segment is the whole sentence, which can encode the contextual meaning of the target.Then the model adopts MIP to detect metaphors.

From WSD to BSD
Metaphor detection (MD) aims to identify whether a contextualized word is metaphorical.Word sense disambiguation (WSD) aims to determine the lexical sense of a certain word from a given sense inventory.The two tasks share the same nature: we should decide the sense of a given word according to its context.A word may have multiple senses, so WSD is a multinomial classification task, whereas MD is a binary classification task.Integrating WSD with MD can be quite expensive.For example, the stateof-the-art model (Barba et al., 2021) regarded WSD as an information extraction task.It concatenated all the candidate senses and tried to extract the correct one.Such a method requires not only external dictionary resources, but also enormous computing resources since the input may be a very long sequence.
WordNet (Miller, 1995;Fellbaum, 1998) ranks the senses of a word according to its occurrence frequency 2 .The most commonly used lexical sense is at the top of the inventory list, which is usually a more basic meaning (Choi et al., 2021;Song et al., 2021;Zhang and Liu, 2022).Thus, we regard the 2 https://wordnet.princeton.edu/frequently-askedquestionsmost commonly used sense as a basic sense of a word, and try to figure out whether a word in a certain context is basic or not.We call this task basic sense discrimination (BSD).Obviously, BSD is a binary classification task and fits MD.

Task Description
Formally, given the MD dataset Both D MD andD BSD will be used to train a multi-task learning model, which will align p MD and p BSD in a same feature space via adversarial training.Our goal is to minimize the risk ϵ = E (x,y)∼p MD [f (x) ̸ = y].We actually use BSD as an auxiliary task and only care about the performance of MD.

Model Details
We present AdMul to tackle MD and BSD simultaneously.As Fig. 1 shows, AdMul has five key parts: shared feature extractor Q f (BERT in our case, the green part), task-specific classifier Q y (the purple part), gradient reversal layer Q λ ( the grey part), global task discriminator Q g d (the red part) and local task discriminators Q lc d (the yellow part).

Feature Extractor
AdMul adopts BERT as the feature extractor Q f , which is shared by both MD and BSD.We take the BERT hidden state of [CLS] as a semantic summary of the input segment pair (Devlin et al., 2019).
[CLS] can automatically learn the positions of two target words in the two segments, and then perceive the semantic difference via self-attention mechanism (Vaswani et al., 2017).The hidden state then goes through a non-linear activation function and produces semantic discrepancy feature v: (1) On the other hand, we use the whole input sequence x to generate sentence embedding h via average pooling: (2)

Task-specific Classifier
Task-specific classifier Q y takes semantic discrepancy feature v as input.For the sake of brevity, we only draw a single classifier in the diagram.Actually, we are using use different classifiers for MD and BSD.
where ŷ ∈ R 2 is the predicted label distribution of x.W Qy and b Qy are weights and bias of Q y .Finally, we can compute classification losses: where L CE is a cross-entropy loss function.ŷi and y i are the predicted probability and the ground truth label of the i-th training sample respectively.

Gradient Reversal Layer
Gradient Reversal Layer (GRL) Q λ is the key point of adversarial learning (Ganin and Lempitsky, 2015).During the forward propagation, GRL works as an identity function.While during the back propagation, it will multiply the gradient by a negative scalar −λ to reverse the gradient.The operations can be formulated as the following pseudo function: where I is an identity matrix and λ can be computed automatically (see Section 4.3).

Global Discriminator
Sentence embedding h = Q f (x) first goes through GRL, then global discriminator Q g d tries to predict which task h belongs to.The training objective of Q g d is: where D = D MD ∪ D BSD .d i is the task label for input x i (d = 0 for MD and d = 1 for BSD).
The feature extractor Q f tries to generate similar features to fool global task discriminator Q g d , so that Q g d cannot accurately discern the source task of the input feature.The features that cannot be used to distinguish the source are task-invariant (Liu et al., 2017;Chen and Cardie, 2018).As the model converges, Q f will learn universal representations to align the distributions for MD and BSD.

Local Discriminator
We noticed some corresponding patterns between MD and BSD via simple linguistic analysis.As Fig. 2   The samples in MD can be classified as literal or metaphorical, while the samples in BSD can be categorized as basic or non-basic.A basic sense (red samples) must be literal (green samples), so they are clustered closer in the feature space.A metaphor (yellow samples) must be non-basic (blue samples), hence they are closer.Moreover, the metaphorical and the basic are significantly dissimilar, so they lie at different corners in the feature space, far from each other.If we bring the literal and the basic closer, then the dividing line between the metaphorical and the literal will be clearer.If the metaphorical and the non-basic get closer, then BSD will be promoted as well.Better performance of BSD will strengthen knowledge transfer from BSD to MD.
Such multi-mode patterns inspire us to apply fine-grained alignment (Pei et al., 2018;Yu et al., 2019).We forcibly push the class 0 samples (literal in MD and basic in BSD) closer, and cluster the class 1 samples (metaphor in MD and non-basic in BSD) closer.Therefore, we use two local discriminators.Each aligns samples from class c ∈ {0, 1}: where d i is the task label and C is the number of classes.d = 0 for MD and d = 1 for BSD.w d is a task weight.To maintain the dominance of MD in local alignment, we set w 0 = 1 and w 1 = 0.3 in all experiments.ŷc i comes from Eq. 3. The classifier Q y will deliver a normalized label distribution for each sample x i , no matter which task it belongs to.We can view it as an attention mechanism.Q y thinks x i has a probability of ŷc i to be class c.Then we use the label distribution as attention weights to apply to the sample.In practice, it performs better than hard attention, because more information can be considered.
The training of local discriminators is also adversarial.The feature extractor Q f generates taskinvariant features to fool local discriminators Q lc d , so that Q lc d cannot discern which task the features in class c come from.

Training Objective
The training of AdMul involves multiple objectives.It can be formulated as the loss function below: where α, β and γ are hyper-parameters to balance the loss magnitudes.θ f , θ d and θ y are parameters of Q f , Q d (all discriminators) and Q y respectively.
The optimization of L involves a mini-max game like Generative Adversarial Network (Goodfellow et al., 2014).The feature extractor Q f tries to make the deep features as similar as possible, so that both global and local task discriminators cannot differentiate which task they come from.After the training converges, the parameters θf , θy and θd will deliver a saddle point of Eq. 10: ( θd ) = arg max At the saddle point, θ y will minimize classification loss L y (combined by L MD y and L BSD y ).θ d will minimize task discrimination loss L d (combined by L g d and L l d ).θ f will maximize the loss of task discriminators (features are task-invariant, so the task discrimination loss increases).AdMul can be easily trained via standard gradient descent algorithms.We take stochastic gradient descent (SGD) as an example: where i denotes the i-th training sample and η is learning rate.The update for θ y and θ d is the same as SGD.As for θ f , if there is no minus sign for ∂L i d ∂θ f , then SGD will minimize the task discrimination loss L d , which means the features generated by Q f are dissimilar across tasks (Ganin and Lempitsky, 2015).

Datasets
Four metaphor detection datasets are used in our experiments.The information is shown in Table 1.VUA All (Steen, 2010) is the largest metaphor detection dataset to date.VUA All labels each word in a sentence.The sentences are from four genres, namely academic, conversation, fiction, and news.VUA Verb (Steen, 2010)  We use a word sense disambiguation (WSD) toolkit (Raganato et al., 2017) to create the basic sense discrimination (BSD) dataset.The toolkit provides SemCor (Miller et al., 1994), the largest manually annotated dataset for WSD.We filter out the targets that have less than 3 senses to balance the magnitudes of WSD and MD datasets.The information of BSD dataset is shown in Table 2 (Wilks, 1975(Wilks, , 1978) ) guide the model design.MisNet regards MD as semantic matching, with dictionary resources leveraged.

Implementation Details
We use DeBERTa base as the backbone (feature extractor Q f in Fig. 1) for all experiments (He et al., 2021), through the APIs provided by HuggingFace (Wolf et al., 2020).The embedding dimension is 768.We set the maximum input sequence length as 150.The optimizer is AdamW (Peters et al., 2019).
We let α = 0.2, β = 0.1, and γ = 0.1 according to the model performance on VUA Verb, and apply them to the rest datasets.The total training epoch, batch size, and learning rate are specific for each dataset, as Table 3 shows.Instead of using a fixed constant, the parameter λ in GRL (Eq.7) is set by λ = m 1+exp(−10p) − n, where m = 1.4 and n = 0.6.p = t T , where t and T are the current training step and the maximum training step respectively.λ is increasing from 0.1 to 0.8 in our case.Such a method stabilizes the training (Ganin and Lempitsky, 2015).At the beginning of the training, λ should be small so that the generated feature is not too hard for task discrimination.With training going on, adversarial training can be strengthened for better knowledge transfer.We choose the best model on the validation set for testing.Since MOH-X and TroFi do not have the training, validation, and testing split, we leverage 10-fold cross-validation.In each iteration, we pack MD and BSD samples into a mini-batch input.They have the same amount (half of the batch size).All experiments are done on an RTX 3090 GPU and CUDA 11.6.

Overall Results
To be consistent with previous studies (Mao et al., 2018;Choi et al., 2021;Zhang and Liu, 2022), we mainly focus on the F1 score.As Table 4 shows, our proposed AdMul obtains great improvements compared with the baseline models.Best scores are reported on 3 out of 4 datasets, including VUA Verb, MOH-X, and TroFi.We attain a comparable result to the state-of-the-art model on VUA All as well.The average F1 score across 4 datasets is 79.98, which is 2.33 points higher than MisNet (77.65 on average).We notice that AdMul performs better on small datasets (VUA Verb, MOH-X, and TroFi) than the large dataset (VUA All).We attribute it to different dataset sizes.Deep learning models need numerous data to achieve good performance, so MTL can help.The knowledge distilled from BSD can greatly promote MD, especially when faced with severe data scarcity problems.MTL also works as a regularization method to avoid overfitting via learning task-invariant features (Liu et al., 2019a).However, VUA All is a large dataset, so there may be a marginal utility for more data from a related task.VUA All requires predictions for each word class as well, while BSD only has open class (i.e., verb, noun, adjective, and adverb) words.Consequently, the rest word class targets cannot get enough transferred knowledge.
The most significant enhancement is from MOH-X.BSD dataset and MOH-X are both built upon WordNet, so the data distributions can be very similar.In such a case, AdMul can easily align globally, and pay more attention to local alignment.The improvement from TroFi is barely satisfactory.TroFi is built via an unsupervised method, therefore it may contain many noises.Many baseline models perform mediocrely on TroFi as observed.
MUL_GCN is the only chosen baseline method in our experiments.MUL_GCN used an L2 loss term to force the encoder of MD and the encoder of WSD to generate similar deep features for both MD and WSD data.However, MUL_GCN only leveraged the features at the output layer, without using parameter-sharing strategy.Thus MUL_GCN did not allow latent interaction between different data distributions, and that is why our method performs better.

VUA All Breakdown Results
Table 5 shows a breakdown analysis of VUA All dataset.The most important part of MD is the model performance on open class words.As we can see, AdMul achieves the best F1 scores on 3 out of 4 word classes, and acquires a result similar to MelBERT on adverbs.The biggest gains are reported on nouns, with 2.8 absolute F1 score improvements against the strongest baseline Mel-BERT.The enhancement in adjectives is also encouraging (2.5 absolute improvements against Mis-Net).Though AdMul performs slightly less well than MisNet on VUA All, AdMul obtains better results on open class words.As we mentioned before, WordNet only has annotated knowledge for open class words, which demonstrates that AdMul can get benefits from MTL.

VUA All Genres
The sentences of VUA All dataset originate from four genres, namely academic, conversation, fiction, and news.The performance of our proposed AdMuL on the four genres is shown in Table 6 The performance of conversation is inferior to the others.Conversations have more closed word classes (e.g., conjunctions, interjections, prepositions, etc.).The performance on academic is the best, since it has more open class words, which are adequate in WordNet.VUA All dataset annotates metaphoricity for closed word classes as well.However, these cases may be confusing.e.g.She checks her appearance in a mirror.
The preposition in in the above sentence is tagged as metaphorical.However, it is quite tricky even for humans to notice the metaphorical sense.As Table 7 shows, there are lots of words in closed classes, but our proposed AdMuL cannot get transferred knowledge from auxiliary task BSD.

Zero-shot Transfer
We use AdMul trained on VUA All to conduct zero-shot transfer on two small datasets, i.e., MOH-X and TroFi.The results are shown in Table 8.Though the performance on VUA All is inferior to MisNet, AdMul has a stronger generalization ability, defeating the baseline models in all metrics across two datasets.It is worth mentioning that DeepMet and MelBERT are trained on an expanded version of VUA All (Choi et al., 2021), so they have more data than us.Our zero-shot performance on MOH-X is even better than fine-tuned MisNet, the previous state-of-the-art method (see Table 4).

Ablation Study
We carried out ablation experiments to prove the effectiveness of each module, as Table 9 shows.We removed global discriminator Q g d , local discriminators Q lc d , and adversarial training (no discriminators used) respectively.Each setting hurts the performance of the MTL framework.It demonstrates that we cannot naively apply MTL to combine MD and BSD.Instead, we should carefully deal with the alignment patterns globally and locally for better knowledge transfer.In addition, we tested DeBERTa base , a model trained only on MD dataset.DeBERTa base takes the target word and its context as input, thus it can be viewed as a realization of MIP.The performance of DeBERTa base is mediocre, which indicates that the progress of AdMul is not only due to the large pre-trained language model, but closely related to our adversarial multi-task learning framework.

Hyper-parameter Discussion
In Eq. 10, there are three hyper-parameters, i.e., α, β, and γ that balance the loss of BSD, global alignment loss, and local alignment loss respectively.
Here we conduct experiments on VUA Verb dataset to see the impacts of different loss weight values.We tune each weight with the rest fixed.The results are shown in Fig. 3.If α is too small, then the model cannot get enough transferred knowledge from BSD.On the contrary, if α is too large, then BSD will dominate the training, leading to poorer performance of MD.Two adversarial weights β and γ share the same pattern.If they are too small, then the data distributions cannot be aligned well globally or locally, resulting in inadequate knowledge transfer.On the contrary, if they are too big, distribution alignment will dominate the training.It is worth mentioning that the training is quite sensitive to γ, because our local alignment is based on a linguistic hypothesis.We should not pay much attention to local alignment, or it will disrupt the correct semantic space, leading to bad results.

Hyper-parameter Search
In this paper, the hyper-parameters are BSD loss weight α, global alignment loss weight β, local alignment loss weight γ, learning rate η, batch size, and total training epoch.We tune each hyper-parameter with the rest fixed.α, β, and γ are searched from 0.05 to 0.5, with an interval of 0.05.η is searched As mentioned before, we tune all hyper-parameters on VUA Verb dataset, and apply them to the rest datasets, except η, batch size, and the total training epoch.

Conclusion
In this paper, we proposed AdMul, an adversarial multi-task learning framework for end-to-end metaphor detection.AdMul uses a new task, basic sense discrimination to promote MD, achieving promising results on several datasets.The zero-shot results even surpass the previous fine-tuned stateof-the-art method.The ablation study demonstrates that the strong ability of AdMul comes not only from the pre-trained language model, but also from our adversarial multi-task learning framework.

Acknowledgement
This

Limitations
Though we simply assume that the most commonly used lexical sense is a more basic sense and such an assumption fits most cases, it may not be accurate all the time.Take the verb dream as an example.
The most commonly used sense of dream according to WordNet is "have a daydream; indulge in a fantasy", which is metaphorical and non-basic.
While it has another literal and basic sense, meaning "experience while sleeping".We are expecting a more fine-grained annotation system to clarify the evolution of different senses: which sense is basic and how other senses are derived.Such a system will benefit both metaphor detection and linguistic ontology studies.Due to computing convenience, our model cannot handle long texts.An indirect metaphor needs to be determined across several sentences.Such a case is beyond our capabilities (Zhang and Liu, 2022).We will also leave it as a future work.

Figure 2 :
Figure 2: Multi-mode structures of MD and BSD data distributions.
AdMul architecture.The black arrows mean forward propagation, while the blue ones denote back propagation.BERT is the shared feature extractor Q f .GRL stands for gradient reversal layer.Classifier Q y is task-specific to perform MD or BSD, and y is the label for MD or BSD.Global discriminator Q g d aligns overall data distribution to make BERT learn universal representations.Two local discriminators Q lc d are in line with two labels.Each is responsible for aligning the data in MD and BSD of label c.Both Q g d and Q lc d predict which task the input sentence comes from.Task d ∈ {0, 1}, 0 or MD and 1 for BSD.L y is the loss for Q y .L d is the loss for Q g d or Q lc d .
(Mohammad et al., 2016)ataset.The target words are all verbs.MOH-X(Mohammad et al., 2016)is sampled from Word-Net, with only verb targets included.WordNet .

Table 3 :
Hyper-parameters.LR stands for learning rate.

Table 4 :
Pre. Rec.F1 Acc.Pre.Rec.F1 Acc.Pre.Rec.F1 Acc.Pre.Rec.F1 Acc.MD Results on VUA All, VUA Verb, MOH-X, and TroFi.The first four baseline models are end-to-end.The best performance for each metric in bold, and the second best in italic underlined.

Table 5 :
MD Breakdown results on VUA All for open word classes, the most important parts of metaphor detection.The first four baseline models are end-to-end.

Table 6 :
. Performance of four genres in VUA All.

Table 7 :
Number for different word classes in VUA All dataset.

Table 9 :
Ablation on VUA Verb.w/o denotes without.
work is supported by 2018 National Major Program of Philosophy and Social Science Fund (18ZDA238), and Tsinghua University Initiative Scientific Research Program (2019THZWJC38).