I run as fast as a rabbit, can you? A Multilingual Simile Dialogue Dataset

A simile is a figure of speech that compares two different things (called the tenor and the vehicle) via shared properties. The tenor and the vehicle are usually connected with comparator words such as"like"or"as". The simile phenomena are unique and complex in a real-life dialogue scene where the tenor and the vehicle can be verbal phrases or sentences, mentioned by different speakers, exist in different sentences, or occur in reversed order. However, the current simile research usually focuses on similes in a triplet tuple (tenor, property, vehicle) or a single sentence where the tenor and vehicle are usually entities or noun phrases, which could not reflect complex simile phenomena in real scenarios. In this paper, we propose a novel and high-quality multilingual simile dialogue (MSD) dataset to facilitate the study of complex simile phenomena. The MSD is the largest manually annotated simile data ($\sim$20K) and it contains both English and Chinese data. Meanwhile, the MSD data can also be used on dialogue tasks to test the ability of dialogue systems when using similes. We design 3 simile tasks (recognition, interpretation, and generation) and 2 dialogue tasks (retrieval and generation) with MSD. For each task, we provide experimental results from strong pre-trained or state-of-the-art models. The experiments demonstrate the challenge of MSD and we have released the data/code on GitHub.


Introduction
Simile plays an important role in human language to make utterances more vivid, interesting, and graspable (Zhang et al., 2021;He et al., 2022) and is an increasingly studied phenomenon in computational linguistics (Song et al., 2021;He et al., 2022).A simile is a figure of speech that compares two things from different categories (called the tenor and the vehicle) via shared properties (Paul, 1970).A tenor and a vehicle are usually connected with comparator words such as "like" or "as".For example, in the first example of Table 1, the tenor is "The boy", the vehicle is "a rabbit", the event is "run", the comparator is "as ... as" and the shared property is "fast".
The current simile research usually focuses on the simile in a triplet (tenor, shared property, vehicle) (Song et al., 2021) or a single sentence (Bizzoni and Lappin, 2018;Liu et al., 2018;Li et al., 2022).For example, the simile recognition (Birke and Sarkar, 2006;Liu et al., 2018) task is judging whether a sentence contains a simile, such as distinguishing which of the first and second examples in Table 1 contains a simile.However, a simile in a triplet or a single sentence is not enough to reflect the complex simile phenomena in the real scenario.In this paper, we study similes in reallife dialogue where a tenor and a vehicle can be mentioned by different speakers, exist in different sentences, or occur in reversed order.The third example in Table 1 shows a simile dialogue where the tenor "That fireman" and the vehicle "a bull" are in different utterances.The fourth example in Table 1 shows a simile where the tenor and the vehicle are mentioned by different speakers and the vehicle occurs before the tenor.What is more, different from previous research where the tenor and vehicle are usually single entities (Song et al., 2021) or simple nominal phrases (Bizzoni and Lappin, 2018), the tenor and vehicle in a real-life dialogue may be a verbal phrase or a long sentence.A verbal phrase can function as the subject or object of a verb, such as the fifth example in Table 1.A sentence is a set of words expressing a statement, a question, or an order, usually containing a subject and a verb.The sixth example in Table 1 shows sentences as the tenor and vehicle.The verbal phrase and sentences can convey richer content and emotions, making the real-life dialogue more vivid and interesting.Studying these complex simile phenomena in a dialogue scenario needs to consider both the dialogue context and the various forms of the tenor and vehicle, and will lead the simile research to a brand new level.However, similes in real-life dialogue scenarios have not been studied by previous research so there are no public benchmarks available nowadays.
To facilitate the simile study, we release a humanannotated, high-quality simile dialogue dataset, which contains both English and Chinese data.The complex simile phenomena in real-life dialogue scenarios not only bring more difficulties to traditional simile tasks such as recognition, interpretation (Su et al., 2016), and generation (Li et al., 2022) but also raise challenges for dialogue research, e.g. generation and retrieval tasks.To address the simile dialogue tasks, dialogue models need to understand the simile relations between entities/phrases/sentences.Our contributions are: • To the best of our knowledge, we are the first to study the simile phenomenon in dialogue and propose a high-quality multi-lingual simile dialogue (MSD) dataset to assist both the simile and dialogue research.
• There are 5 tasks with the proposed MSD dataset.
For simile research, we design the dialogue simile recognition/interpretation/generation tasks.
For dialogue research, we design the response retrieval and generation tasks.
• We verify how strong pre-trained models and the state-of-the-art simile models perform on the 5 tasks we designed.Experimental results reveal that simile in dialogue is a difficult task and requires further study.Our code and data will be released on GitHub1 .

Noun phrase
The nurse is an angel.

Adjective
These words are cold.The soldier had a warm heart.

Verbal
The process was killed.They plant the seeds of change.Adverb-Verb He speak fluidly.

Verbal phrase
Taking care of pets is like raising children.Sentence I rushed to the terminal like a cheetah chasing its prey. 2 Related Work

Simile and Metaphor
The simile is a kind of metaphor that is frequently used in human languages to make utterances more vivid and graspable (Niculae and Danescu-Niculescu-Mizil, 2014) and expresses human sentiments (Li et al., 2012;Mohammad et al., 2016).Previous researchers defined different metaphor categories.We present examples for these categories in the first four lines of Table 2.For example, Bizzoni and Lappin (2018) categorized metaphor into Noun phrases, Adjectives, Verbs, and Multi-word; Li et al. (2022) categorized metaphor into Nominal, Verbal (Subject-Verb-Object), Adjective-Noun, and Adverb-Verb.Previous work usually denoted the Noun phrase metaphor as a simile (Li et al., 2022;He et al., 2022;Chen et al., 2022).Following previous work, we also categorize Noun phrase metaphor as a simile.Meanwhile, we extend the tenor and vehicle to verbal phrases and sentences according to the simile phenomena in dialogue.
The examples of verbal phrases and sentences in simile are shown in the last two lines of Table 2.
pretation to a metaphorical expression (Bizzoni and Lappin, 2018) or infers the shared properties of the tenor and the vehicle (Song et al., 2021;He et al., 2022;Chen et al., 2022).The generation task also has different forms.For example, when giving an input tenor, it can generate a simile sentence conditioned on the input tenor (Li et al., 2022); when giving both the tenor and the shared property in simile, it can generate the vehicle (Song et al., 2021;Chen et al., 2022); when providing a literal sentence, it can generate a metaphoric sentence which paraphrases that input (Chakrabarty et al., 2020;Stowe et al., 2021), or generating a specific simile according to the location where the simile interpolation should happen (Zhang et al., 2021).
In this paper, we also define recognition, interpretation, and generation tasks.However, different from previous work that only focused on similes in a triplet tuple or a sentence, we investigate a more challenging scenario where the simile happens in a multi-turn dialogue.

Survey of Simile Datasets
Table 3 shows the comparison between our MSD dataset with the existing simile datasets.

Multilingual Simile Dialogue Dataset
In this section, we introduce the collection, annotation, and statistics of our MSD data.

Data Collection
Since we aim to extract the simile in a real-life dialogue, we adopt the existing open-domain dialogue corpus collected from social platforms such as Reddit.comand Weibo.com.For English similes, we use the 3 turns version Reddit Dialogue dataset (Dziri et al., 2018) which contains more than 15 million dialogues.For Chinese similes, we use two datasets: PchatbotW and LCCC.The PchatbotW (Qian et al., 2021) is the largest dialogue data we can find and contains 139 million 2 turns dialogues from Weibo.The LCCC (Wang et al., 2020) is also from Weibo and contains 12 million 2 or 3 turns dialogues.We treat the last utterance in a dialogue as a response and the utterances in front of the response as a dialogue context.We extract dialogues from these large-scale datasets with a rigorous data collection pipeline, which is built based on a set of rules we will introduce in this section.Notice that we do not make any changes to the original dialogue data and only extract those dialogues with comparators in the response.
In the first step, we select the dialogue examples where the responses contain comparators such as   "像...一样"/"like"/"as...as"2 .We only select dialogue examples with context lengths between 15 and 30 words so that the dialogue context is both informative and not too long for the annotators to read.These examples are denoted by the coarse version of the simile dialogue data and the statistics are shown in Table 4.
In the second step, we use machine translation3 to ensure that a sentence contains a comparator.We only reserve the dialogue examples that still contain comparator when they are translated into another language.For example, an English simile candidate sentence "I run as fast as a rabbit" contains a comparator "as...as".When translating it into Chinese, this sentence is "我跑得像兔子一样快" and still contains a comparator "像"(like).After the machine translation checking, we got the fine version of the simile dialogue candidates.The fine version needs further improvement since the candidate tenor/vehicle connected by the comparator is not always a simile.For example, the sentence "The Poodle is as tall as a Corgi" is not a simile since the sentence compares the height of two different kinds of dogs.So we conduct a third step to further remove examples that are not similes.
In the third step, we adopt a semantic dependency tool4 to locate the candidate tenor/vehicle, then we compute the similarity between them to retain the examples with low similarity so that the remaining candidate tenor/vehicle are from different categories.The similarity is computed with dense representations of the candidate tenor/vehicle from BERT (Devlin et al., 2019).After the above pipeline, we obtain the final version of simile dialogue data for annotation.The statistics of the fine/final version we obtained are also shown in Table 4.

Data Annotation
We recruited 7 students majoring in English for annotating the English data and recruit other 6 welleducated native speakers (graduate students) for annotating the Chinese data.We randomly select 100 examples in the final version, finding that the vehicle candidates we extracted have an acceptable accuracy (above 80%).However, the accuracy of the tenor candidate is not good (below 60%).Hence, we provided annotators with "dialogue context", "response", "comparator", and "vehicle candidate" for each dialogue.We use the annotation tool proposed by Yang et al. (2018) to simplify the operation so that the annotators can use a mouse and a few shortcuts on the keyboard to annotate.
There are some difficulties when annotating similes in the dialogue scenario apart from the fact that the tenor may exist in different sentences or occur after the vehicle.For example, the tenor may not exist in the dialogue even if the response is a simile.We ask the annotators to delete these examples.There are other situations that a dialogue that contains commonly used phrases or slang that makes the dialogue seem like a simile but not.For instance, "make like a tree" is not a simile but slang means "leave".Besides, English words usually have different meanings.For example, according to the Oxford Dictionary, the word "body" means "the whole physical structure of a human or an animal" as well as "a group of people who work or act together, often for an official purpose".So the sentence "This association is like the body that represents its members." is not a simile.Furthermore, there are many abbreviations used on social platforms such as FTW (for the win) and OP (original poster).These difficult linguistic phenomena require the annotators to have a good understanding of the dialogue context so that they could determine whether a response contains a simile.
We conduct preliminary training for the recruited annotators so that they are aware of the professional  Quality Evaluation.During the annotation, each time we send a small "*.txt" file containing hundreds of dialogue examples to the annotators and conduct a random sampling test after they return the annotated data 5 .The annotator who returns a low-quality file will be asked to check their annotation again before we send the next file.The whole annotation takes 35 days, and each dialogue is annotated by 3 annotators.When determining the final result, the majority will be adopted when there is a disagreement among the three annotators 6 .The overall inter-rater agreement measured by Fliess' Kappa is 0.61, indicating a substantial agreement among the annotators.

Data Statistics
After the annotation, we get a total of 19,565 (8,146 English and 11,419 Chinese) dialogues.The MSD has multiple comparators for both English and Chinese data.In MSD English data, the "like" mode is around 52.4% and the "as" mode is around 47.6%.In MSD Chinese data, "像...一样" accounts for the 5 During annotation, we randomly selected 5% of the examples from one annotated file and checked if the annotator made accurate annotations for these random examples.The annotators were preliminary trained so that they were expected to make as few errors as possible.We expected no more than 1 error per 20 examples in the random sampling test.Otherwise, the file will be sent back for revision. 6There are a few cases where the three annotators disagree with each other, we decide these cases by ourselves.

Model
Precision most7 .The proportion of each comparator is similar in simile and literal data.Table 5 shows some of the statistics of the MSD data.Please refer to the data link for more details.

Tasks and Results
In this section, we introduce the 5 tasks defined with our MSD dataset.Including the definition of the task, the baselines, evaluation metrics, experimental results, and analysis.The implementation details are shown in the Appendix A.

Simile Recognition Task
Following previous work (Liu et al., 2018;Li et al., 2022), we define simile recognition as a binary classification task where the model needs to distinguish whether an input sequence contains a simile.The input is a multi-turn dialogue and the output is True (simile) or False (literal).

Baselines and Evaluation Metrics
We use two baselines: 1) BERT is widely used and proven to be effective in classification tasks.We randomly split our MSD-En/Ch data into train/validation/test (8:1:1) sets and use the train set to fine-tune BERT.We use the output vector of the first input token <cls> of BERT to calculate the classification score for the input dialogue (see Appendix A); 2) a large language model (ChatGLM8 ).
The input to ChatGLM is a concatenation of three parts: the definition of simile "A simile is a figure of speech that compares two different things via their shared properties."; a requirement "answer yes or no to this question: is the following dialogue example contains a simile?"; a simile dialogue examples such as in Table 1.Then we calculate the results according to the prediction of the baselines.Following previous work (Liu et al., 2018), we use Precision/Recall/F1 to measure the results.

Results and Analysis
Table 6 shows the simile recognition results.We can see that BERT(fine-tuned) performs much better on Precision and F1 than ChatGLM on both MSD-En and MSD-Ch9 .It is reasonable since the BERT models are fine-tuned on our training set.
On the other hand, the ChatGLM is much better on Recall with a zero-shot setting.Overall, the classification results on both BERT and ChatGLM still have a lot of room to improve.Using syntactic structure information to locate simile components may help this task.

Simile Interpretation/Generation Tasks
Following the previous simile interpretation task (Song et al., 2021;He et al., 2022) and simile generation task (Song et al., 2021), we define Simile Interpretation/Generation (SI/SG) as a Multi-choice task with the "as...as" mode in our MSD-En10 data (we test with 450 examples) since the shared property naturally exists in the comparator.
For interpretation task, we have a simile dialogue where the shared property between two "as"s is removed and replaced with a blank.The model needs to select a property from 4 choices (one correct answer and three distractors) for the blank.We construct the distractors with ConceptNet (Speer et al., 2017).In particular, we first use the tenor and some relations to find the related concept to the tenor and then use the HasProperty relation to find the distractors.Notice that for the examples where the tenor is a phrase of a sentence we could not find in ConceptNet, we use keywords (e.g. the subject of the sentence, the noun in the phrase) as the tenor to search ConceptNet.
Similar to the simile interpretation task, we remove the vehicle in a simile dialogue and leave a blank for the simile generation task.The model needs to select a proper vehicle for this blank from 4 candidates (one correct answer and three distractors).We also construct the distractors with Con-ceptNet.We use the vehicle and certain relations in the ConceptNet to find the related concepts to the vehicle as the distractors.Notice that for the examples where the vehicle is a phrase or sentence that we could not find in ConceptNet, we use the vehicles from other dialogues in MSD dataset as the distractors.
To ensure the distractors are true-negative, we randomly select 50 dialogue examples and manually check the quality of the distractors.We find that 92% of the distractors are well selected and the rest 8% are not as ideal as we expected but can still serve as distractors.More details about using ConceptNets are shown in Appendix C.

Baselines and Evaluation Metrics
The first baseline is a BERT-large model which takes the whole dialogue with the shared property or the vehicle masked and predicts the masked words.The second baseline is the BERT-Probe (He et al., 2022) that fine-tunes BERT with the simile interpretation task.To compare both SI and SG tasks with this baseline.We further finetune the BERT-Probe model with the SG task using the data proposed by He et al. (2022).The third baseline is BERT-ANT (Chen et al., 2022) which is trained with masked word prediction with metaphor data and can solve the Simile Interpretation and Generation tasks in a unified framework of simile triple completion.For example, when giving tenor=fireman and vehicle=bull, BERT-ANT can generate a list of words including the shared property like "strong" or "brave".All baselines are based on a BERT-large-uncased model.Since there are multiple masked words in our SI/SG experiments.We encode the predicted words and the candidates into dense vectors with a sentence-transformer (huggingface.co/sentencetransformers/all-MiniLM-L6-v2).Then we compute the cosine similarity between the predicted words and each of the candidates.The candidate with the highest similarity is chosen as the answer.We use Hit@1 to measure the accuracy.

Results and Analysis
Table 7 shows the results of simile interpretation/generation tasks.We can see that BERT-Probe performs better than BERT-large in this task, showing that a model pre-trained on simile data can better align the simile components in an input sequence and predict the missing component, even though the training data is much different from our proposed data.The BERT-ANT performs similarly to the other two models on SG tasks but not as well at SI.It is because the training data of BERT-ANT is more of a metaphor data rather than simile data, a large portion of the metaphor data does not have shared properties.Hence, BERT-ANT is more powerful in connecting tenor and vehicle but is less powerful when predicting shared properties.Overall, the results on both simile interpretations/generations still have a lot of room to improve.How to exploit the semantic information in context to help these tasks requires further study.

Response Retrieval Task
Following previous work in retrieval (Guo et al., 2016), we define Response Retrieval as a ranking task.The input is a multi-turn dialogue context and multiple response candidates (including the correct one) and the model needs to rank all the candidates so that the correct one has the highest score.In particular, for each "dialogue context" in MSD simile data (both English and Chinese), we randomly select 19 responses from other dialogue as the negative examples.

Baselines and Evaluation Metrics
We use BERT-base for our baseline in response retrieval since it is widely used and proven to be effective in retrieval tasks.We concatenate dialogue context and each of the response candidates as the input sequence to the pre-trained model.Then we use the output of the first input token <cls> to compute the score for the input sequence as in Appendix A. Finally, the response candidate with the highest score will be chosen as the answer.
We first randomly split the Reddit dialogue data into train/validation/test (14.99M/5K/5K) sets.Then we used the BERT model to train an English dialogue retrieval model with this train/validation data.The model is denoted by BERT(Reddit).We choose a checkpoint with the best performance on the validation set.Then we use this checkpoint to compare its performance on both the Reddit Test set and the MSD-En set.Similarly, we combine LCCC and PchatbotW and randomly select 12M/5K/5K from the combined data as train/validation/test sets and train a Chinese dialogue retrieval model.The trained BERT 11 model is denoted by BERT(Ch) and used to do the comparison of the performance on the LCCC+PchatbotW 11 https://huggingface.co/bert-base-chinese Model R20@1 R20@2 R20@5 MSD-En simile data Test set and the MSD-Ch set.We measure the accuracy of the retrieval with Recall@1/2/5.

Results and Analysis
Table 8 shows the results of the response retrieval task.The performance of BERT(Reddit) and BERT(LCCC) on MSD is lower than their performance on Reddit and LCCC+PchatbotW Test sets, respectively.The results show that the data distribution in MSD is different from the data used to extract it and selecting a simile response is much harder than selecting a proper response.The low Recall results show that the dialogue retrieval task on MSD simile data needs further study.This requires a model that judges not only the relevance between context and response but also the plausibility of similes.

Response Generation Task
The traditional response generation task uses dialogue context as input and outputs the response of the context.In this section, we also introduce a new generation task that completes the response sentence behind the comparator.Taking the fifth simile dialogue "Arguing with parents is not wise.It is like throwing an egg at a rock." as an example, we give the model "Arguing with parents is not wise.It is like" as input and ask the model to generate the rest "throwing an egg at a rock.".This is different from the Writing Polishment with Similes Zhang et al. ( 2021) task since our task is a dialogue scene.The model needs to understand the difference between different speakers and complete the simile sentence.We use the simile data in MSD for the generation experiments.We conduct comparative experiments on the Reddit-dialogue Test set and the LCCC+PchatbotW Test set we used in the response retrieval task to show the difference between datasets.

Baselines and Evaluation Metrics
For the traditional response generation task, we use the DialoGPT (Zhang et al., 2020) and GODEL  (Peng et al., 2022) for English data; use T5-base 12 , BART-large 13 (Lewis et al., 2020), GPT-2 14 (Radford et al., 2019), and CDialGPT 15 (Wang et al., 2020) for Chinese data.We choose these baselines since 1) they are widely used and proven to be effective in dialogue generation tasks.For example, GODEL (Grounded Open Dialogue Language Model) is pre-trained for dialogue and is initiated from T5 (Raffel et al., 2020).CDialGPT and BARTlarge are pre-trained with LCCC-large; 2) the different size models can provide more insight into the experiments.For our proposed response generation (completion) task, we conduct the experiment on English data with DialoGPT.We use the following automatic evaluation metrics employed by dialogue research.Perplexity (PPL), BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Lavie and Agarwal, 2007), and Distinct (Li et al., 2016).PPL measures the probability of the model predicting the real response.BLEU measures the n-gram overlap between the generated response and the reference one.ROUGE is based on the calculation of the recall rate of the common sub-sequence of generating response and the real one.METEOR further considers the alignment between the generated and the real responses to improve BLEU.Distinct measures the diversity of responses by calculating the proportion of distinct n-grams in the total number of n-grams.Higher BLEU/ROUGE/METEOR/Distinct means better performance.The PPL is provided for comparing models with the same vocabularies, and the results are also useful for future research.

Results and Analysis
Table 9 shows the generation and completion results.On most metrics of English data, DialoGPT and GODEL perform better on MSD-En than on Reddit-dialogue.CDialoGPT and GPT-2 have comparable performance on the LCCC+PchatbotW Test set and MSD-Ch.This is different from the response retrieval tasks where the MSD data is more difficult than the original data used to extract MSD.The reason may be the dialogue context in MSD provides more information than the context in the original data, so the generation models could leverage the rich context information to construct an informative response.Experiments also verify that larger models (GODEL/T5/BART) have a better performance.However, even the performance of the best baseline can still be improved.We analyze the generation results.Although there are some interesting cases, most of the results are not similes.It means the simile dialogue generation task requires a specific model design to capture the simile relations in context.We provide a case study in Appendix D.
For the response completion task, when giving the comparator, DialoGPT has a big performance gain.It proves that the simile generation can benefit from the guide.Please refer to our code/data link for more experimental results about this simile dialogue completion task.

Conclusion
We propose manually annotated multilingual simile dialogue (MSD) data for both simile and dialogue research.We design 3 simile tasks (recognition, interpretation, and generation) and 2 dialogue tasks (retrieval and generation) with MSD.Experiments with strong baselines show the challenge of each task.Future works include but are not limited to 1) Dataset enlargement (e.g., more annotated examples with more kinds of comparators); 2) Model designing (e.g., models with a specific structure to address the proposed tasks); 3) New task designing (e.g., detecting tenor in the coarse/fine data).We encourage using the MSD in future simile and dialogue research.

Limitations
Due to time constraints, we were unable to implement some unreleased models as baselines for the proposed tasks.We did not conduct simile interpretation/generation on MSD-Ch in this paper since we could not automatically annotate the shared property in Chinese data like the "as...as" mode in English.We are currently working on this annotation and plan to release the Chinese simile interpretation/generation results on the data link.The coarse/fine version data we introduced in this paper can still be used for enlarging the MSD data.We will study to utilize them for more simile data and richer language phenomena.

Ethics Statement
We provide and emphasize some details of our work to address potential ethical concerns.First, all the data sources used in the data collection process are publicly available.We did not make any changes to the data sources and only extracted dialogue examples from these data.We carried out strict quality control during the extraction and annotation process.We made sure that there are no sensitive words even though the original data sources have already conducted this kind of checking.However, using our data to train or fine-tune a pre-trained generation model may still generate semantic errors or unpleasant similes or responses.One reason is that simile is a difficult task that compares two different things, mistakes could happen even when humans use similes.The other reason is that the knowledge stored in the original parameters of the pre-trained models may dominate the generation.We protect the privacy rights of annotators and paid 0.55 Chinese Yuan for annotating each dialogue data.The income of each annotator was above 100 Chinese Yuan per hour (On January 20, 2023, 100 yuan can be converted into 14.73 dollars).

A Implementation Appendix
The implementations of the pre-trained models in this paper are all based on the public Pytorch implementation 16 .The hyper-parameters follow the default settings.We did not truncate any of the dialogue because the dialogue length in MSD data is much smaller than the maximum input length of the pre-trained models.We use a single Tesla v100s GPU with 32gb memory to conduct experiments, the batch size is 8 for all experiments.Checkpoints are chosen with the best performance on the corresponding validation set.In simile recognition and dialogue retrieval tasks, the first input position of the model G is a special token "<cls>", and the corresponding output vector E cls is fed into a nonlinear layer to compute the final score of the input sequence: where W 1,2 and b 1,2 are training parameters; σ/µ is the sigmoid/tanh function, respectively.When training the simile recognition model, the loss is cross-entropy between predicted labels y i and ground-truth label ȳi : Where N is the number of simile examples.When training the dialogue retrieval model, the loss is calculated as follows: where C is the dialogue context, R is the response, and α is a hyper-parameter meaning the number of different negative samples for a positive one.We set α = 9 in our training.
In Table 12, we provide two cases to show the dialogue cases in MSD and the generation results from different models.In the first English example, both DialoGPT and GODEL generate fluent responses and contain the comparator "like" or "as".However, both models fail to generate a simile response like the ground truth one.The Chinese example is extracted from the LCCC data, we can see BARTlarge performs the best and gives an informative response with a simile in it.The GPT-2 gives a general response and T5-base gives an informative response.The CDialGPT also gives a general response even if it is trained with the LCCC dataset.The two cases in Table 12 further verify that simile dialogue generation is challenging.However, in the response completion task, when adding the comparator in the input, we can see the DialoGPT outputs a simile and makes the dialogue more vivid and interesting.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: The data collection and annotation process.

Table 1 :
Examples to illustrate simile.The underline font represents tenors.The italic font means vehicles.A and B are different Speakers.

Table 2 :
Different metaphor categories.The underline font represents tenors.The italic font means vehicles.The similes in our MSD data cover Noun phrases, Verbal phrases, and Sentence categories.The two examples in Adjective show two different Adjective-Noun modes.The two examples in Verbal are Subject-Verb and Subject-Verb-Object modes.

Table 4 :
Statistics of the dialogue datasets we collected.

Table 5 :
The statistics of the MSD dataset."diff."means"different"."Ave." is short for "Average".standards.We ask the annotators to first check whether the response in this dialogue example contains a simile.The example will be annotated "Literal" if the response is not a simile.Otherwise, they should check whether the vehicle candidate in the response is correct.They need to annotate the correct vehicle (can be word/phrase/sentence) if the candidate is not accurate.If the candidate vehicle is correct, they can annotate the tenor (can be word/phrase/sentence) if it exists.We present the annotation schedule in Figure1.Our annotation schedule ensures that the tenor and vehicle are in the data.

Table 9 :
Dialogue generation and completion results.