Nominal Metaphor Generation with Multitask Learning

Metaphor generation is a challenging task which can impact many downstream tasks such as improving user satisfaction with dialogue systems and story generation. This paper tackles the problem of Chinese nominal metaphor generation by introducing a multi-task metaphor generation framework with self-training and metaphor identiﬁcation mechanisms. Self-training addresses the data scarcity issue of metaphor datasets. That is, instead of solely relying on labelled metaphor datasets which are usually small in size, self-training helps identify potential metaphors from a large-scale unlabelled corpus for metaphor generation. The metaphor weighting mechanism enables our model to focus on the metaphor-related parts of the input (e.g., the comparison of the metaphor and comparator) during model learning and thus improves the metaphoricity of the generated metaphors. Our model is trained on an annotated corpus consisting of 6.3k sentences that contain diverse metaphorical expressions. Experimental results show that our model is able to generate metaphors with better readability and creativity compared to the baseline models, even in the situation where training data is insufﬁcient.


Introduction
Metaphor is commonly used in human language as an effective communication device.Typically, metaphors compare a concept or an object to another with the intent to make the expression more vivid, or to make unfamiliar things easier to understand (Paul, 1970).
According to linguistic studies of Chinese, metaphors are particularly important in Chinese as metaphor is the dominant figurative language in Chinese (Wang, 2004).As shown in Chinese, are figures of speech associating a noun with another noun through a comparator such as and 像,是,变 成 (equivalent to like, be, become in English).Wang (2004) claims that the nominal metaphor requires the comparison to be drawn from objects different in nature.Therefore, even though the fourth example in the table uses a classic comparator "like", it does not make it a metaphor as it compares a person to another person.Verb metaphors are metaphors whose verbs are used metaphorically.The verb metaphor shown in Table 1 uses weaving, a verb which is usually related to cloth or loom, to describe human relationships.
The third type of Chinese metaphor is personification (also known as 拟人 in Chinese), which treats objects as human and can act like humans.Previous efforts on metaphor generation demonstrate the task can bring benefits to a wide range of NLG downstream tasks.Glucksberg (1989) suggested that verb metaphors are important to an engaging conversation.Zhou (2020) showed machine-generated NMs are effective in stimulating user interest in communicating with chatbots.Chakrabarty et al. (2020Chakrabarty et al. ( , 2021) ) conducted human evaluations comparing literal expressions from machine-generated stories and poems with machine-generated metaphors and found users prefer the text with metaphors.
In this paper, we mainly focus on generating the nominal metaphors.The generation of the NMs is defined as follows: given the subject of the metaphor, i.e., "Lilies" in the first example of Table 1, generate a comparison containing the comparator and the object of the comparison, i.e., "like" and "a bottle of perfume" in the example, respectively.There are two main challenges to the Chinese metaphor generation task.The first issue is the annotated corpus.Existing Chinese metaphor corpora are not large enough to power current datadriven text generation approaches.Second, the auto-regressive fashion language modelling is ineffective for learning metaphor generation.Because a metaphor can be hidden in a very long sentence, the generative model tends to learn the entire sentence sequence rather than focusing on the metaphorical part of the input.
To address the aforementioned challenges, we propose a novel neural metaphor generation model that requires only limited labelled metaphor data for model training.This is achieved by a multitask framework which jointly performs novel self-training and metaphor weighting mechanisms.First, to tackle the scarcity issue of metaphor datasets, we employ self-training to leverage additional unlabelled data to improve the metaphor generation performance.Self-training consists of three main steps: (1) train a teacher model on labelled training data; (2) detect potential metaphors in the unlabelled corpus; and (3) train a student model on the combination of the labelled as well as newly identified metaphors from the unlabelled data.Second, we propose to employ metaphor identification to reveal metaphor-related parts of the input.This permits our model to focus on the metaphor-related parts of the entire input sentence via assigning higher weights to metaphor-related content.Introducing metaphor identification not only improves the efficiency of the model training process, but also improves the metaphoricity of the generated metaphors.
As there are limited data available for nominal metaphor generation, we collect and annotate two corpora for our model training, namely, Chinese Metaphor Corpus (CMC) and Chinese Literature Corpus (CLC).CMC contains 2.7k metaphor examples and 3.5k literal examples, which can be used for both metaphor detection and gen-eration.CLC is a large-scale unlabelled Chinese literature corpus, which can be leveraged by our self-training algorithm for identifying additional high quality labelled metaphors.We conduct both automatic and human evaluation to evaluate our model's performance in metaphor generation.Experimental results show that our model is able to generate metaphors with better readability and creativity compared to the baseline models, even in the situation where training data is insufficient.Source code and data can be found in https://github.com/liyucheng09/Metaphor_Generator.

Related Work
Prior works on computational processing of metaphors can generally be classified into detection, interpretation and generation tasks.

Detection and Interpretation of Metaphors
Krishnakumaran and Zhu (2007) exploit the absence of a hyponymy relation between subject and object to identify metaphorical utterances.Shlomo and Last (2015) propose a random forest-based classifier for NM identification using both conceptual features such as abstractness and semantic relatedness such as domain corpus frequency.Su et al. (2016) follow the idea of hyponymy relationship absence from (Krishnakumaran and Zhu, 2007) and implement it using cosine distance between pre-trained word2vec embeddings of the source and target concepts.Liu et al. (2018); Zeng et al. (2020) tackle Chinese simile detection by designing a multi-task framework and a local attention mechanism.Here, simile is a type of NM, which uses direct comparator such as "like" and "as".Su et al. (2016Su et al. ( , 2017) ) focus on Nominal and verb metaphor interpretation and perform experiments on English and Chinese metaphors.They extract properties of both the subject and object of the metaphors from WordNet and use pre-trained word2vec embeddings to identify related properties shared by the compared objects/concepts pair.

Generation of Metaphors
Despite the benefits that metaphor generation can bring to many NLG tasks, works on this task are still relatively sparse.Early works for metaphor generation often rely on templates.Terai and Nakagawa (2010) compute the relatedness between concepts with computational language analysis and select candidates to fill metaphor templates, e.g., "A is like B". Veale (2016)

Methodology
In this section, we provide the technical detail of our proposed framework metaphor generation.The overall model architecture is shown in Figure 1, which includes: (1) the GPT2 model (Radford et al., 2019) and text prediction layer; (2) the metaphor identification module, which identifies potential metaphors from the unlabelled dataset (i.e., via selftraining) and emphasises metaphorical words of the input sequence (i.e., via metaphor weighting).

Text Modeling and Metaphor Identification Module
The Pre-Trained Language Model We employ GPT2, a pre-trained unidirectional transformer lan-guage model, as our basic encoder.Given a sentence S = (w 0 , • • • , w n ), the GPT2 model produces a list of contextualized token embedding , where h i is the representation of the i-th input token w i .Since the GPT2 model is a unidirectional language model, the produced contextualized word embedding captures only the information of preceding context.For example, h i captures information asserted before the i-th token, which is The Text Prediction Layer To predict the output token, we apply a linear layer followed by a softmax function to the contextualized word embedding.
where W and b are trainable weight matrix and bias for text generation, respectively; h i is the contextualised word embedding of the i-th word.

The Metaphor Identification Module
The metaphor identification module is used to assign metaphorical probability to sentences (used in self-training) or sub-sentences (used in metaphor weighting).Specifically, we apply a linear layer plus a softmax layer on the contextualized embedding to compute the metaphorical probability.Formally, after obtaining word representations (h 0 , h 1 , • • • , h n ) by GPT2, we compute the metaphor probability as follow: where W m and b m are the trainable weight matrix and bias for metaphor identification, and p i is the metaphor probability of the sub-sentence w 0 , • • • , w i (or the whole sentence if i = n).Because h i captures the contextual information preceding w i , p i indicates whether the sub-sentence w 0 , • • • , w i contains metaphorical expression.

Self-Training
Self-training is an effective approach to make use of additional data to enhance deep learning model performance.It has shown significant progress in many deep learning tasks, such as machine translation (He et al., 2019), speech recognition (Parthasarathi and Strom, 2019), and image classification (Xie et al., 2020).As shown in (He et al., 2019), self-training provides significant performance gains in sequence generation when the supervised corpus is relatively small.To ensure  1, each unlabelled instance is weighted by its metaphorical probability during metaphor modelling.We use the metaphor identification module to score each sentence with the probability of being metaphorical.Formally, given a unlabelled sentence x = (w 0 , • • • , w n ), the metaphor identification module compute its metaphorical probability p n .
where h n is the representation of w n , the last token of x.Since h n captures context information of the entire sentence, we regard p n as the metaphorical probability of x and use it to weight this training instance.

Metaphor Weighting
We find that in metaphorical sentences, only partial of the text is metaphor-related(i.e., the comparison between objects/concepts pair and comparator).To verify, we analyse 200 metaphorical sentences randomly sampled from our dataset.We empirically find that for these metaphorical sentences, only 27% of words are metaphor-related.This indicates that it is necessary to encourage our model to focus on metaphor-related parts of input metaphors; otherwise, it might lead to lower training efficiency and sub-optimal performance.To tackle this issue, we propose a novel approach based on metaphor detection to identify metaphorical words within sentences.As shown in Figure 2, this is achieved by calculating the contribution to metaphor probability of each word to identify metaphor-related words.Formally, we use I i to indicate the importance of the i-th token in the metaphor modelling.
where p i−1 , p i represents the metaphorical probability of subsentences w 0 , • • • , w i−1 and w 0 , • • • , w i .If I i is positive, it indicates the i-th token makes the sub-sentence w 0 , • • • , w i more metaphorical.On the contrary, if I i is zero or negative, it means this token is irrelevant to metaphorical expression.We weight each training step (i.e., given w 0 , • • • , w i−1 , predict w i ) with its corresponding I ′ i , which is computed as follow: (5)

Training and Inference
Training The training process of metaphor identification is as follows.Given a labelled corpus , where threetance consists of a sentence x and a label y indicating whether x is a metaphorical sentence.We minimize the following loss function: where h n is the representation of the last token of x.
The training procedure of metaphor modelling is as follows.Given an unlabelled dataset C = {x i } N i=1 where each instance is a sentence x = (w 0 , • • • , w n ) with a list of words.Each instance is weighted by its metaphorical probability and each token w i is weighted by I i , i.e., the metaphorical probability of word w i .Formally, the loss function of our metaphor modelling is given as follows: Inference At the inference stage, we regard our model as a normal metaphorical language model to generate metaphors and do not perform metaphor identification.Given a target word w t , i.e., the subject of the metaphor, we feed the target word concatenated with delimiter as input to our model.Our model produces the next word recurrently until the ENDOFSENTENCE token is generated.

Experiment 4.1 dataset
To We recruited three native Chinese annotators to perform the metaphor annotation task.Before annotation, we provide annotation guidelines to annotators and explain relevant definitions with metaphor examples.The annotation task is to judge whether a sentence is a metaphor.Each sentence is labelled by three annotators and the final label takes the majority of the three labelled results.The annotation agreement of CMC is 0.84 based on Krippendorff's alpha (Krippendorff, 2011) statistics.The statistics of CMC are shown in Table 2, along with some metaphor examples given in Table 3.

CLC dataset
In self-training, we need a largescale corpus to enable the metaphor identification module detecting novel NMs.However, popular Chinese corpora, such as news, Wikipedia, web pages, are not suitable to be used as metaphor resources.Intuitively, literature text might be a promising resource of diverse metaphors.Therefore, we construct a Chinese literature corpus by collecting a large number of essays, novels, and fictions (see details in Appendix A).The statistics of CLC are shown in Table 2.

Baselines
Chinese metaphor generation is a novel task.We select three general generative models and an English simile generation method as baselines.RNN: A LSTM based auto-regressive generative model, which consists of three LSTM layers.SeqGAN: Sequence Generative adversarial network (Yu et al., 2017)    decoding, all beam sizes are set to 12, thus each model generated 12 sentence for each target.In total, 2400 sentences are obtained per model for testing.

Metrics
Automatic Metrics We use perplexity (PPL) to evaluate the fluency of the generated text, which is calculated by an open source Chinese language model (Zhang et al., 2020).Dist-1 and Dist-2 (Li et al., 2016) compute the distinct unigrams and bigrams ratio of generated text, which are used to measure model's ability to produce diversity outputs.To test the metaphoricity (Meta) of generated outputs, we train a RoBERTa-based Chinese metaphor classifier on CMC to compute the ratio of metaphorical utterances in the generated sentences.The accuracy of this classifier is 97.89%, which is robust enough to perform evaluation.We show details of the classifier in Appendix C. Human Evaluation Due to the creative and delicate usage of metaphor, automatic metrics are not adequate to test the quality of generated outputs.We also perform human evaluation based on the following three criteria: 1) Fluency indicates how well the metaphor is formed; whether the expression is grammatical and fluent.2) Consistency indicates whether the metaphor make sense; how well the subject of the metaphor related to the object.3) Creativity scores how creative annotators think the metaphor is.Note that the Creativity judgment is based on annotators' real-life experience, rather than measuring whether the generated metaphor appears in the training dataset.Three annotators were instructed to rate the three criteria from 1 to 5, where 1 denotes worst and 5 be the best.

Automatic Evaluation
Results of automatic metrics are shown in Table 4.
Our method significantly outperforms baselines in most automatic metrics.Our model obtains a lower PPL, which illustrates our model is better at producing fluency and grammatical text.Higher Dist-1 and Dist-2 scores show our method produces less repetitive unigrams and bigrams during generation, which is essential in creative language generation.The Meta (metaphor) score shows that our model produces more literal expressions than baselines, which might result from the self-training procedure, where non-metaphorical sentences are sometimes wrongly identified by the metaphor identification module, and hence introduces some noise in the metaphor learning process.
We implemented an ablation study to test the effectiveness of self-training and metaphor weighting.Experimental results show the effectiveness of the self-training mechanism in improving both generation fluency and diversity.It can also be observed that removing self-training from our model affects four automatic metrics by a large margin.The metaphor weighting mechanism mainly helps improve the metaphoricity of the generated metaphors and thus improves the Meta score.

Human Evaluation
We select 180 sentences in total for human evaluation, where the results are shown in can see that our method beats five baseline models on all three human-centric metrics.The most significant improvement lies in Consistency and Creativity, which shows our method not only can generate creative comparisons, but also can provide a consistent context for each nominal metaphor, which is essential for readability and explainability.Human evaluation also demonstrates the effectiveness of self-training.Self-training enhances generation quality in both fluency and creativity aspects.Metaphor weighting mechanism shows less effectiveness in human evaluation as metaphor weighting mainly aims to improve the metaphoricity of the generated output and human metrics do not measure this aspect.We provide a visualisation for metaphor weighting in Appendix E.

Case Study
We show some generated examples of GPT2, SCOPE, and our model in Table 5, where the corresponding Consistency and Creativity scores are also provided.Specifically, models generate metaphors by taking different target word as the input.We see that although all three models are able to produce metaphorical outputs, the quality of the generated results differs among systems.First, in some cases, baseline model seem fail to generate metaphorical outputs.For example, when feeding "autumn" as input to GPT2 and the SCOPE model, both fail to produce a metaphor but a literal description.Second, the comparisons given by our model are more creative than baselines, where GPT2 and SCOPE tend to generate with some common metaphor patterns.For example, GPT2 compares mountain with "jade", which is a very common metaphor in Chinese.Finally, we find our method generates metaphors in a relatively more complicated structure and speaks in a more poetic way.For example, our method does not employ a single word in constructing comparison; instead, it tend to generate detailed phrases such as "love is a gambling and I was the gambler", "autumn wind is like a ribbon brush against you".These detailed components paint a more vivid picture, and thus improve the overall readability of the metaphors.The corresponding human-rated Consistency and Creativity scores also support this observation.

Conclusion
In

B SCOPE Model
SCOPE model takes a literal expression as input and produces a simile correspondingly.For example, given "the city is beautiful", SCOPE model will transfer the literal expression into a simile: "The city is like a painting".
In our experiments, to compare SCOPE with our method, we first 1) feed a TENOR to COMET (Bosselut et al., 2019) model, to get properties of the TENOR.For example, given a query "<Autumn, SymbolOf>", COMET predicts a list of properties for Autumn: "Passion, gold" etc.We then 2) construct literal expressions using the TENOR and its properties.For example, "Autumn is a symbol of passion" is obtained.3) The literal expression is fed to SCOPE model and a simile is produced.For example, "Autumn is like a lover" is produced by SCOPE model.4) At last, the simile are concatenate with its literal expression to form a complete NM with context: "Autumn is a symbol of passion, like a lover".

C Meta Metric
The CMC corpus is splited into training set (80%) and test set (20%) for training the classifier.We simply add a linear layer plus a binary softmax layer on the RoBERTa model as the NM classifier.The accuracy of the classifier tested on test set of CMC is 97.89%.

D More Examples
Table 7 shows generations produced by our method given different TENORS.

E Visualization of Metaphor Weighting
The visualization of metaphor weighting mechanism is shown in Table 8.Once it comes out of the scabbard, it will cut off the bond of your life in an instant.秋天像个美人的画笔调侃着大地：世界上 再没有比这更美的了。 Autumn teases the earth like a beautiful brush: there is nothing more beautiful in the world.爱心像一片照射在冬日的光，使饥寒交迫 的人感到人间的温暖.
Love is like a piece of sunshine in winter, which makes hungry and cold people feel the warmth of the world

Figure 1 :
Figure 1: The overall framework of our method.The metaphor identification module performs self-training and metaphorical word identification.
metaphor examples.VUAMC is widely used in metaphor identification and served as a benchmark in the first and second ACL workshop on figurative languages(Leong et al., 2018(Leong et al., , 2020)).Chakrabarty et al. (2020) crawl web content and construct the first English simile corpus to train their metaphor generator.Liu et al. (2018) release a small Chinese metaphor corpus with 120 examples including both verb and nominal metaphors.(Zeng et al., 2020) publish a Chinese simile dataset focusing on a specific NM which using the comparator "like".

Figure 2 :
Figure 2: Metaphorical words identification by computing its contribution to metaphor probability.

Text
Love is like a ray of golden light, which can illuminate your heart even at night.爱像一盏明亮的夜灯，让迷途的航船找到 港湾；Love is like a bright night light, let the lost ship find the harbor.时间像利剑一样无情的锋刃，一旦出鞘， 瞬间就割断你人生的纽带。 Time is a ruthless blade like a sharp sword.
Table 1: Examples of different types of Chinese metaphor: nominal metaphor, verb metaphor, and personification metaphors.Metaphorical words are bold.
Classic self-training starts from a teacher model trained with a supervised dataset U .The teacher model is then applied to unlabelled data to obtain the pseudo label.Finally, we train a student model on the syntactic dataset incorporating the pseudo label dataset and the supervised dataset S ∪ U .Instead of training a independent teacher model, we embed the teacher model in the overall model as an attention mechanism.As shown in Figure 1 Train a base model f on U = {x i , y i }; 2 repeat 3 Apply f to the unlabeled instance C ; 4 Select a subset S ⊂ {(x, f (x))|x ∈ C} ; 5 Train a new model f ′ on S ∪ U ; with a generator implemented by LSTM network and a discriminator implemented by CNN network.We train this model on CMC to produce Chinse metaphor.GPT2: The Chinese GPT2 model is fine-tuned on the CMC dataset to produce Chinese metaphors as a baseline model.

Table 3 :
Examples of metaphor and not metaphor in CMC.
(Chakrabarty et al., 2020)l data (target word, metaphor) from CMC and use the paired data to fine-tune a Chinese version BART model(Shao et al., 2021)model.SCOPE:(Chakrabarty et al., 2020)A SOTA method on English simile generation tasks, which fine-tunes BART model on a large-scale automatically created literal-simile parallel corpus.

Table 4 :
Results of automatic metrics and human evaluation.Boldface denotes the best results among our method and baselines.The inter-annotator agreement for human evaluation are shown in parenthesis.

Table 4 .
The Table also shows the inter-annotator agreement of human annotation via Krippendorff's alpha.We

Table 5 :
Example metaphors generated by our method and baselines.Con. and Cre.indicate the two human evaluation metrics Consistency and Creativity respectively.We do not assign Con. and Cre.score for nonmetaphorical utterances.More examples of our method are shown in Appendix D.

Table 6 :
Summary of CLC.

Table 7 :
More generation examples of our method.Wandering will let him see things he has never seen before.

Table 8 :
Sentences in Chinese literature corpus with its metaphor probability and the visualization of weights in metaphor weighting mechanism are presented for each token.