StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding

Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.


Introduction
Analogy-making plays a central role in human reasoning abilities.By drawing similarities between seemingly unrelated concepts (e.g., in Figure 1, "virus" v.s."burglar") and processes ("the virus invades cells" v.s."the burglar breaks into the house"), we can infer that the virus infiltrates and damages cells in a similar way to how a burglar breaks into a house to steal or cause harm.These story-level analogies, which involve comparing entire narratives or coherent sequences of events, enable intelligent agents to gain insights (Boden, 2009;Ding et al., 2023;Bhavya et al., 2023) and understand complex phenomena (Webb et al., 2022).
As a result, the valuablesinside are smashed Despite its significance, there has been limited research on story analogies.One of the reasons is the lack of available data and evaluation benchmarks.In contrast, the community has predominantly focused on word-level analogies, which involve identifying relational similarities between pairs of concepts (e.g., king to man is like queen to woman) (Mikolov et al., 2013;Gladkova et al., 2016;Czinczoll et al., 2022).
In this work, we introduce STORYANALOGY, a large-scale story-level analogy corpus derived from various domains: scientific scripts, social narratives, word analogies, and knowledge graph triples, to facilitate the study of complex analogies.The story-level analogies we examine contain richer relational details, such as relations between entities (e.g., virus, invades, cells) and between events (e.g., the virus invades cells, as a result, the virus damages DNAs).
One of the challenges in building STORYANALOGY is establishing a clear and specific way to evaluate story analogies.To address this problem, we extend the Structure-Mapping Theory (SMT; Gentner, 1983) to evaluate on longer texts.According to SMT, analogies hold (e.g., the hydrogen atom vs. the Solar System) because of the similarity in relational information (e.g., the relative motion between objects), rather than attributive information (e.g., size), between the source and target.Conversely, if both types of information are similar, the source and target

En#ty/topic similarity Rela#on similarity
Mere-appearance

Literal similarity Analogy
Food goes up from the stomach.The food enters the esophagus.
Magma goes up from the inside of the planet.The magma enters volcanos.
The flashlight is turned on.Two contact strips touch one another.
These rocks become volcanos.The volcanos erupt many times.
Magma rises from deep in the earth.The magma goes into volcanos.

Anomaly (dissimilarity)
Source: exhibit a literal similarity (e.g., the X12 star system v.s. the Solar System).Inspired by this notion, we extend SMT to the story level ( § 2.1).We use entity and relation similarity to assess the level of similarity in attributes and relations between the source and target stories.Additionally, we propose an analogy score based on these two similarities to quantify the degree of analogy between stories.Figure 2 provides a visual representation of the similarity space spanned by the two similarities.
We then collect candidate story analogies for similarity annotations.Since story analogies are scarce in free texts2 , we use large language models (LLMs) to generate story pairs that are likely to be analogies.The stories are sourced from various domains, including scientific scripts (Dalvi et al., 2018), social commonsense stories (Mostafazadeh et al., 2016), word-level analogies (Turney et al., 2003;Czinczoll et al., 2022), and knowledge graphs (Speer et al., 2017).Next, we conduct crowd-sourcing to obtain similarity annotations for each candidate story pair.As a result, we create STORYANALOGY, which consists of 24K diverse story pairs, each with human annotation guided by the extended SMT.
Based on STORYANALOGY, we curate a set of tests to evaluate the analogy identification ability of models.Our findings indicate that both competitive encoder models (such as SimCSE (Gao et al., 2021) and OpenAI's text-embedding-002) and LLMs (such as ChatGPT (OpenAI, 2022) and LLaMa (Touvron et al., 2023)) have a significant gap compared to human performance in terms of predicting the level of analogy between stories.We further evaluate LLMs using multiple choice questions derived from the story candidates.Even the best-performing LLM still falls short of human performance by 37.7%.Furthermore, we discover that using stories in STORYANALOGY can enhance models' ability to identify and generate analogies.By employing few-shot in-context learning and finetuning on STORYANALOGY, baseline models achieve a considerable performance boost.For instance, a fine-tuned FlanT5-xxl model exhibits generation quality on par with zero-shot ChatGPT.We hope that the data and evaluation settings we proposed in this study will benefit the research community in the area of story analogies.

STORYANALOGY
Conventional benchmarks in computational analogy primarily focus on word-level analogies (e.g.word to language is like note to music).However, less attention has been given to more sophisticated analogies.We introduce STORYANALOGY, a dataset of 24,388 pairs of stories (e.g., "The virus invades cells and DNAs are damaged."versus "A burglar breaks into the house and smashes the valuables inside."),each annotated with two dimensions of similarity based to SMT.

Evaluating story analogies
To assess the degree of analogy between a pair of instances, recent studies classify story pairs using a set of labels.For instance, Sultan and Shahaf (2023) use 5 labels including not-analogy, self-analogy, close-analogy, far-analogy, and sub-analogy.Nagarajah et al. (2022) use 6 labels: shallow attribute analogy, deep attribute analogy, relational analogy, event analogy, structural analogy, and moral/purpose.However, they observed very poor agreement among annotators for most labels, which indicates a vague understanding of the task.Making comparisons across these studies are challenging due to the vastly different settings.
In cognitive psychology, the Structure Mapping Theory (SMT; Gentner, 1983) is well-known for its explanation of the cognitive process of making analogies between objects.SMT evaluates object comparisons from two perspectives: (a) the attributes of objects and (b) the relational structures between objects.Analogies between objects .The Domain column indicates the source of the story pairs."PP", "ROC", "WA", and "CN" are short for "ProPara", "ROCStories", "Word Analogy", and "ConceptNet", respectively.occur when they have similar relational structures but dissimilar attributes (e.g., the hydrogen atom v.s. the Solar System).In contrast, literal similarity occurs when objects have both similar relational structures and attributes (e.g., the X12 star system v.s. the Solar System).
Based on SMT, we propose to compare stories by their entity and relation similarity.These measures assess the degree of similarity in terms of attributive and relational structures, respectively.We provide necessary extensions to their definitions:

Entity similarity (EntSim
).The similarity of entity and topics discussed between a pair of stories, ranging from 0 (unrelated) to 3 (almost equivalent).This score should be high if the two stories are both discussing apples and pears, even if they differ greatly in the details.

Relation similarity (RelSim
).The similarity of relational structures between a pair of stories, ranging from 0 (very poor alignment) to 3 (alignment).In this context, the relational structures refer to the connections between elements at different levels.For instance, first-order relations can be regarded as the relationship between entities, such as predicates.Second-order relations, on the other hand, represent connections between higher granularity elements, such as the logical connection between events or sentences.We encourage annotators to also consider higher-order relational similarity, such as the moral or purpose behind the stories.
We present the established similarity space with example source and target stories in Figure 2.
Modeling the analogy score (α).We discuss possible definitions of the analogy score (α).The score α should be proportional to the level of analogy between a pair of stories.Defining α to be equivalent with RelSim has been adopted in word analogy (Ushio et al., 2021a).However, this definition cannot distinguish analogy from literal similarity, as both of them have high RelSim (Figure 2).We can alleviate this problem by introducing EntSim to the definition of α: according to SMT, analogy happens when the RelSim between the source and target story is high and the EntSim is low3 .
Therefore, in the rest of this paper, we define α as RelSim/EntSim4 .

Distilling story analogies from LLMs
Obtaining a large number of story analogy by retrieval is difficult.Evidence from Sultan and Shahaf (2023) shows that the prevalence of analogies within a categorized dataset is around 3%.It is expected that the ratio is much lower in general corpora.Identifying analogies by retrieving from general corpora would thus require huge human efforts, making it unrealistic to build a large-scale story analogy collection in this way.Recently observations suggest that LLMs are capable of understanding and predicting analogies for problem-solving (Webb et al., 2022), cross-domain creativities (Ding et al., 2023), and generating explanations for word analogies (Bhavya et al., 2022).In addition to these findings, we discover that LLMs can generate highquality story analogies (i.e., with more than a half generations being analogies).Here, we introduce the pipeline for generating story analogies.Generating from word pairs.Given a word analogy pair (e.g."word", "language" and "note", "music"), together with source-target analogies with the corresponding entities from seed examples, an LLM is prompted to generate both the source and target stories.

Annotation
To evaluate each candidate story pair under the extended SMT, we conduct crowd annotations on Amazon Mechanical Turk9 .We recruit crowd workers to annotate the entity and relation similarities for the collected pairs.In addition, workers are required to label an instance as "poor quality" if they find the generated content broken or toxic.The annotation consists of the following two rounds: (i) Qualification round.We first annotate 80 candidate story pairs (20 from each domain) to curate a qualification set.Three domain experts from our team are asked to read through the annotation instruction and independently annotate EntSim and RelSim for these pairs.The Spearman's ρ between each annotator's prediction with the average scores of the others ranges from 93% to 96% on EntSim, and from 89% to 95% on RelSim.
We invite crowd workers who have ≥90% history approval rates and have ≥1K HITs approved to attend the qualification.Workers whose predictions achieve ≥70% Spearman's ρ with the average scores from three experts pass the qualification.As a result, 158 and 80 workers passed the qualification for EntSim and RelSim, respectively.(ii) Main round.Qualified crowd workers are invited to attend the main round annotations.We assign 5 different annotators to give predictions for each similarity of a story pair.To guarantee the annotation quality, we follow the annotation setting in (Agirre et al., 2012).We split the main round into multiple mini-rounds, each with 1K-2K candidate pairs.After each mini-round, we filter out and disqualify workers who do not show significant correlations with the average scores of the others.They are paid more than what is required by the local wage law.In addition, experts from our team manually check the quality of annotations and write feedback to workers correspondingly.
The generated contents sometimes contain hallucinations or toxic contents.We filter out story pairs labeled as "poor quality" by more than 10% an- notators, which accounts for 142 instances.For each story pair, we adopt the average scores from workers as the predicted EntSim and RelSim.

Analysis of STORYANALOGY
To assess inter-annotator agreement, we randomly sampled 1K instances with 3 independent annotations from our dataset.The Fleiss's kappa (Fleiss, 1971) on the binarized annotations of EntSim are 47%, and 42% on RelSim, indicating moderate agreement among annotators.In addition, we additionally obtained expert annotations on 200 randomly sampled instances.The averaged Spearman's correlation between crowd and expert annotations on EntSim and RelSim is 64.7% and 69.9%, respectively.
The final dataset consists of 24,388 story pairs on four domains: ProPara (6.9K), ROCStories (4.9K), Word-Analogy (7.5K), and ConceptNet (5.0K).Stories in STORYANALOGY have 19.94 tokens on average.The distributions of EntSim and RelSim are presented in Figure 3.We randomly select 500 instances from each domain as the test set, and another 500 instances from each domain as the validation set.Examples of STORYANALOGY are shown in Table 1.

Story Analogy Identification
We begin by assessing the ability of models to identify story analogies using two different setups.The first evaluation setup is similar to Semantic Textual Similarity (STS) tasks (Agirre et al., 2012), where we calculate the Spearman's correlation between models' predicted similarity and the analogy scores (α) derived from annotations ( § 3.1).For the second evaluation, we reframe our dataset as multiple-choice questions and evaluate LLMs on this set ( § 3.2).

Correlation with the analogy score α
Similar to the STS-style evaluation (Agirre et al., 2012), we assess whether models can predict analogy scores based on embeddings (for encoder models) or by generation (for LLMs).We use a model to predict the similarity f (⋅, ⋅) for two stories.For encoder models, f (s1, s2) = Cosine(Encoder(s1), Encoder(s2)).
For LLMs, we prompt them to predict the EntSim and RelSim for the two stories.Finally, Spearman's correlations between the predicted similarity and the respective scores are reported.
Setups.We consider both encoder models and LLMs as baselines.Details are in § A.2.
The encoder models we evaluate include RoBERTa (Liu et al., 2019), SimCSE (Gao et al., 2021), OpenAI-ada (text-embedding-ada-002), Discourse Marker Representation (DMR) (Ru et al., 2023), RelBERT (Ushio et al., 2021b), and GloVe embeddings (Pennington et al., 2014) on nouns, verbs, or all words 10 .In addition to the unsupervised encoder models, we also fine-tune two models on the training set: a regression model, RoBERTa-Reg, which has a multilayer perceptron on top of the RoBERTa model that predicts EntSim and RelSim, and a contrastive learningbased model, RoBERTa-CL, which uses a contrastive learning objective to optimize its representations.
For LLMs, we test FlanT5 (Chung et al., 2022), LLaMa (Touvron et al., 2023), ChatGPT (OpenAI, 2022), and .Each model input is composed of three parts: the instructions, which give explanations to the similarity scores; N examples, and the query story pair.We evaluate models with two instructions (short and long, where short instructions only contain the labels, and long instructions additionally have label definitions), and N is set to 0, 1, or 3.
Results.The overall evaluation results are presented in Table 2. Generally, the models perform relatively poorly on the analogy score α, indicating that there is still room for improvement on STORYANALOGY.E, R, and α).Here, E, R, and α correspond to EntSim, RelSim, and the analogy score RelSim/EntSim, respectively.The LLM performance is evaluated under the "long instruction+3-shot" setting.We have the following observations: (1) Similarities from state-of-the-art sentence embedding models are not good indicators for story analogy.Encoders such as RoBERTa, SimCSE, and OpenAIada show relatively good correlation with EntSim and RelSim, but they perform poorly on the analogy score α.This suggests that their embeddings are suitable for literal similarity retrieval but not analogy retrieval.(2) Relational feature-aware models are better at analogy identification.Additionally, we find that encoder models aware of relation information, such as DMR (discourse relation), RelBERT (inter-word relation), and GloVe-Verb (predicates), correlate better with the analogy score α. (3) Finetuning improves models' analogy identification ability.The finetuned models, RoBERTa-Reg and RoBERTa-CL, are the topperforming models that significantly outperform all the other baselines on α. (4) Generally, LLMs do not perform well on the analogy score α.As shown in Figure 4, most LLMs can benefit from longer instructions as the extra definitions help in understanding the scores.Moreover, we find that despite its size, FlanT5-xxl is one of the best-performing LLMs in terms of predicting EntSim and RelSim.

Multiple choice evaluation
We construct a multiple-choice evaluation set using the annotated story pairs.First, we gather story pairs with EntSim < 1.0 and RelSim > 2.0.For each target story, we choose 3 negative choices to form the candidates.Out of these, two (easy) negative choices are randomly selected, while one (hard) negative example is chosen by retrieving stories with high nounal similarity (measured by the cosine similarity of the nounal GloVe embeddings) and < 50% token overlap.An example question is provided in Table 3.To assess human performance, we conduct human annotations.
We assess LLMs on multiple-choice questions.Each model input consists of an instruction, N examples of multiple-choice questions, and the query Question: Which candidate story is the best creative analogy for the source story?Source: Carbonic acid in rainwater breaks down rock.Plants grow in rock.
(0) Plants and animals grow and reproduce.The population size gets larger and larger.
(1) Recyclables are placed in a centralized container for the house.Recyclables are picked up by a recycling company.
(2) Salty ocean water erodes metal.Corals thrive on metal. (3) The roots of the growing plants start to break up the rock.
The plant acids dissolve the rock.
Answer: (2) multiple-choice question.We evaluate the models using three different instructions, such as "Which candidate story is the best creative analogy for the source story?",where N can be 0, 1, or 3.As a baseline, we obtain the performance of the analogy retrieval model in (Sultan and Shahaf, 2023) on our multiple choice questions, which achieves an accuracy of 44.9%.
Results.The results are presented in Table 4. Interestingly, while annotators can answer the questions correctly at an accuracy of 85.7%, LLMs struggle on selecting the most analogous story (the averaged accuracy for text-davinci-003 is merely 37.1%).
Increasing the number of demonstrations does not show consistent benefits to model prediction.Also, we find that explicitly instructing models to choose the "creative analogy" ( § A.3, question template B) or to provide a definition of SMT when explaining analogies (template C) yields better performance compared to simply asking models to select the best analogy (template A).
We present the breakdown ratio of the percentage of types of choices selected in Table 5.We have the following observations: (1) LLMs can can differentiate between randomly sampled easy negatives and other choices.The proportion of easy neg- atives they select is less than 20%, whereas random chance would be 50%.Furthermore, more powerful LLMs like GPT-3.5 are better at this judgement compared to LLaMa and FlanT5-xxl.
(2) LLMs can be easily distracted by hard negatives, as they often have a similar or higher chance of selecting hard negative choices instead of the targets.This suggests that the models prioritize surface similarity over structural similarity, despite the latter being more important in identifying analogies. 11(3) In comparison, the baseline model from (Sultan and Shahaf, 2023) is more resilient against hard negative distractions.This is likely due to its framework design, which captures the structural similarity between stories by clustering entities and finding the mappings between clusters.

Story Analogy Generation
We examine whether the dataset STORYANALOGY can enhance the ability of analogy generation.We evaluate FlanT5 (Chung et al., 2022), LLaMa (Touvron et al., 2023), ChatGPT (OpenAI, 2022), and GPT-3.5 in zero-shot and few-shot settings using 40 source stories from the test set.To explore the potential of smaller models in generating high-quality analogies, we fine-tuned FlanT5-xl (3B parameters) and FlanT5-xxl (11B parameters) using the same template.
A crowd annotation is conducted to evaluate the quality of the generated stories from the models mentioned above.Workers are provided with a source story and its corresponding generated target story.They are then asked to assess the following: (1) Whether the target story is an analogy for the Table 6: The crowd-annotated generation quality (%) in terms of (1) Whether the target story is considered an analogy to the source; (2) Novelty of the target story; (3) Plausibility of the generations.
source (as opposed to being a literal similarity or something else); (2) whether the target story is novel compared to the source; and (3) whether the target is plausible (More details can be found in § A.4).The average scores from three annotators are reported in Table 6.Example generations are shown in Table 7.
Under the zero-shot setting, we observe that FlanT5 and LLaMa struggle to generate meaningful analogies.They often tend to repeat patterns from the source stories (e.g., only replacing one word).In contrast, ChatGPT and GPT-3.5 produce more flexible stories that are frequently considered as analogies and novel.
Stories in STORYANALOGY can help models generate better analogies.With a few demonstrations, we observe a significant improvement in the generation quality of LLaMa (+28.4% and +27.5%).Moderate improvement on ChatGPT and GPT-3.5 is also observed.Notably, finetuning smaller LMs enhanced their generation quality.The finetuned FlanT5-xxl model performs better than the zero-shot ChatGPT and is comparable to the few-shot ChatGPT and GPT-3.5, despite having fewer parameters.Furthermore, while models become more creative through finetuning and in-context learning, their generation plausibility decreases, indicating an increase in hallucination.

Related Work
Word-level analogy.One of the famous works on word-level computational analogy was (Mikolov et al., 2013), where they found that word analogies can be predicted by word vector offsets.For instance,

ChatGPT
Just as a sled sliding down a steep hill gains momentum as it accelerates, so does a projectile as it falls under the force of gravity.
GPT-3.5An Olympic runner is running a middle distance race.

LLaMa-65B
A projectile is affected by gravity.It falls and picks up speed.the development of pretrained language models (PLMs) such as BERT (Devlin et al., 2018), there have been works utilizing PLMs to solve word analogies by LM perplexity (Ushio et al., 2021a), pretrain relational embedding on certain prompt templates (Ushio et al., 2021b), or use word analogies as latent restriction to implicitly probe relational knowledge (Rezaee and Camacho-Collados, 2022).

FlanT5
In this line of work, a typical evaluation setting is ranking word pairs based on their relational similarity with the source pair (Mikolov et al., 2013;Czinczoll et al., 2022).For instance, given a word pair A:B, the aim is to select a target pair C:D such that the relation between C and D is the most similar to A:B among all candidates.This is similar to our multiple-choice evaluation setting.
In comparison, only a handful of research has been done in sentence or paragraph-level analogy: Analogous text retrieval.Built on the famous structure mapping theory, SME (Falkenhainer et al., 1989) and LRME (Turney, 2008) model the analogy retrieval problem as an entity-mapping problem.They then solve this problem through web mining.Sultan and Shahaf (2023) develops a QA-SRL based analogy retrieval method to conduct entity mapping.However, these works evaluate their methods by annotating the precision of the top-ranked results, leaving no large-scale analogy evaluation benchmarks to date.Analogy generation.Recently, there have been attempts at pretraining or prompting LMs for analogy generation.Bhavya et al. (2022); Webb et al. (2022) evaluated LLMs' ability on solving word analogy tasks, where they found that large language models such as GPT-3.5 can surpass human performance on certain word analogy tasks.Ding et al. (2023) evaluated LLMs' creativity in terms of crossdomain analogies.Bhavya et al. (2022); Chen et al. (2022a) evaluated LMs' ability on generating explanations for word analogies.Bhavya et al. (2023) proposed a novel analogy mining framework based on generation.Analogy benchmarks.There are many word-level analogy datasets.Google (Mikolov et al., 2013), BATS (Gladkova et al., 2016) contain relatively easier syntactic or shallow semantic relations.In contrast, U2 and U4, and Czinczoll et al. (2022) include examples with relatively more abstract relations.To the best of our knowledge, there is no large-scale story-level analogy data or resources as of the time of writing.The only related works here are (Li and Zhao, 2021;Zhu and de Melo, 2020), which transform word analogy pairs into sentence pairs with a few templates.Nagarajah et al. (2022) tried to annotate a tiny scale story analogy benchmark based on fables, but they failed to achieve.Wijesiriwardene et al. ( 2023) re-organized sentence relation datasets, where they viewed such relations (e.g., entailment, negation) are analogy, which is fundamentally different from our settings.Analogy in other domains.In addition to analogies on word pairs and stories, there have been related studies on other topics.Hope et al. (2017) contribute a method for analogy mining over products.Chan et al. (2018) mine analogies from research papers with respect to their background, purpose, mechanism, and findings.Gilon et al. ( 2018) develop a search engine for expressing and abstracting specific design needs.Recently, Bitton et al. (2023) propose a visual analogies dataset, VASR, where they found that models struggle to find out analogies when given carefully chosen distractors.

Conclusion
We introduce STORYANALOGY, a multi-domain story-level analogy corpus with 24K story analo-gies pairs annotated on two similarities under the extended SMT.To assess the analogy identification and generation capabilities of various models, we have devised a series of tests based on STORYANALOGY.The experimental findings indicate that current encoder models and LLMs still fall short of human performance in analogy identification.Additionally, we demonstrate that generative models can greatly benefit from our dataset.

Limitations
We attempted to ensure dataset coverage by utilizing seed data from various sources.However, there are still specific domains that we were unable to include, such as biomedical stories or academic articles.We can extend the annotation to these domains using the annotation framework and evaluation metrics mentioned in this paper.Additionally, we have explored applications such as analogy identification (Section 3) and generation (Section 4).The potential of STORYANALOGY to be applied for creativity generation tasks (such as poetry, lyrics and humor generation) has not been fully investigated.Further development on other sources and applications is left as future work.

A Appendix
A.1 Details in creating STORYANALOGY.
We prompt text-davinci-003 model to collect the story analogy candidates.The prompt template is Demonstrations.The in-context learning seed examples are presented in Table 8.In addition to the golden story analogies, we also curated the corresponding keyword pairs with regard to each story pair.This keyword pairs are useful for prompting to generate candidate stories from the Word Analogy and ConceptNet inputs, where their input data format are word-pairs (e.g.word: language :: note: music).
To construct a list of demonstrations for each data source, we ask experts to construct a set of analogous story pairs by web-searching and revising the results.Then, to ensure the diversity of the analogy, we make a list of orthogonal topics in each dataset, and randomly sampled demonstrations from these subtopics every time we construct a prompt.
Prompt templates for "generating from story pairs".The template for the demonstration is: "Example:\n(1){source story(i)}\nAn analogy for story (1) can be:\n(2){target story(i)}" It is concatenated with a prompt at the end: "Example:\n(1){source story}\nAn analogy for story (1) can be:" Prompt templates for "generating from word pairs".The prompt template for generating from word pairs is: "Write a group of 2-sentence The stream becomes a river.The river continues to flow along the same path for a long time.

ENTITY: stream, river
A person grows from a child into an adult.As time passes, the person experiences ongoing growth and maturation.

ENTITY: child, adult
Magma rises from deep in the earth.The magma goes into volcanos.ENTITY: magma, volcanos Food goes up from the stomach.The food enters the esophagus.

ENTITY: food, esophagus
The plasma membrane encloses the animal cell.It controls the movement of materials into and out of the cell.ENTITY: plasma membrane, cell Security guards monitor the doors of the factory.They manage the entry and exit of personnel to and from the factory.

ENTITY: security guard, factory
The tadpole begins storing food in the tail.The tadpole develops hind legs and lives off food stored in the it's tail.

ENTITY: tadpole, food
A person saves money in a savings account.The person relies on the saved funds to meet future financial obligations and sustain their lifestyle.

ENTITY: human, money
The sediment near the bottom is compressed by the weight of newer sediment.The sediment becomes sedimentary rock as it is pushed together by the heavy weight.ENTITY: sediment, sedimentary rock A person's ideas and beliefs are shaped by their experiences and influences.The person's thoughts and opinions become more solidified and defined as they are influenced by outside forces.ENTITY: belief, solidified belief Morgan enjoyed long walks on the beach.She and her boyfriend decided to go for a long walk.

ENTITY: beach, walking
Lenny liked to climb trees.He embarked on a treeclimbing expedition in the woods.

ENTITY: woods, climbing trees
He got a call from his girlfriend, asking where he was.Frank suddenly realized he had a date that night.

ENTITY: call, date
She received a notification on her phone, reminding her of an upcoming meeting.Jane suddenly remembered there was an important presentation to give.ENTITY: notification, presentation She was petrified and prayed to get out of the test.On the last day of lessons, the bus broke down and she was spared.

ENTITY: test, fear
He was terrified of the upcoming job interview.Due to oversleeping on the day of the interview, he missed the appointment and thus avoided the stress.

ENTITY: job interview, stress
He is only two weeks into his job and he is nervous.Every time he responds to calls he gets very worried.

ENTITY: job, nervous
Having recently started a relationship, she is grappling with anxiety.She becomes highly anxious whenever they have a disagreement.ENTITY: relationship, anxious She made sure she was quiet and respected others' space.It was strange that on Wednesday, she came to the office hung over.ENTITY: introverted, getting drunk James took care to comply with the rules and demonstrate deference towards authority figures.Surprisingly, he was caught shoplifting on a Friday.ENTITY: disciplined, shoplifting align very well between the two stories.Following the above instruction, evaluate the relational similarity for S1 and S2 (only answer by a score from 0, 1, 2, 3): {N-DEMONSTRATIONS HERE} Q: S1 -{INPUT-S1} S2 -{INPUT-S2} Score :" Here, we insert N∈ {0, 1, 3} demonstrations at the "N-DEMONSTRATION HERE" and fill in the story pairs at "INPUT-S1" and "INPUT-S2".The "short instruction" templates are similar, with the only difference that the detailed definition of scores is removed.For instance, "0 : Unrelated.The two stories are talking about different topics and entities of different types." is replaced with "0 : Unrelated." To construct the multiple-choice evaluation set, we gather story analogy pairs with EntSim < 1.0 and RelSim > 2.0.Next, we sample negatives for each story analogy pairs to form multiple choice questions.Similar to the GloVe baseline in § A.2, we obtain the nounal embedding for each story, and retrieve stories with high cosine similarities while have <50% overlapped tokens as the hard negative choices.We manually inspect the overall quality of the multiple choice questions constructed in this manner.We excluded the questions generated from the ROCStories split due to their lower quality, likely because the unusual distribution of EntSim in this split made it difficult to use the same method for creating the dataset as in the other splits (Figure 3).The resulting multiple choice dataset consists of 360 questions.
Baselines in (Sultan and Shahaf, 2023) We apply both the FMQ and FMV models, as suggested in (Sultan and Shahaf, 2023), to our story analogy identification task.To be precise, we gather the intermediate story pair similarities generated by their models.Afterwards, we choose the option that exhibits the highest similarity to the source story.Notably, the lengths of the stories in our dataset are considerably shorter than datasets used in its paper.Therefore, when running the baseline on our dataset, we adjusted the threshold of the similarity filter to better suit our settings.We selected a threshold of 0.3 for FMQ and 0.2 for FMV.For the other implementation details, we follow the origi-nal settings in their code repo at https://github.com/orensul/analogies_mining.As for the result, we discovered that FMQ and FMV exhibited comparable performance (44.9% versus 44.7%) on the multiple choice dataset.The result from FMQ are reported in the main paper.
The A.4 Details in the evaluation of story analogy generation.
The models are evaluated under zero-shot, fewshot, and instruction-tuning settings.For zeroshot and few-shot prompting, the templates are: "Write an analogy for story 1.\n \nStory 1: {}\nStory 2:" This template is also used in the finetuning setting.For finetuning, we employ DeepSpeed14 to accelerate the training on a single 8*V100 (32GB) instance.
We conduct crowd annotation on AMT to evaluate the generation quality.The annotation instruction is presented in Figure 5.During the annotation, the meta information of the target generation is hidden from the annotators and the requesters.In addition, the target stories are shuffled such that annotators cannot find out which models are used to generate the stories based on the order.

A.5 Miscellaneous
In this section, we present some discussions that took place during the reviewing process.
A.5.1 Potential applications of this work.
Analogy Mining for Art and Design.There have been various studies focusing on building analogical search engines.Hope et al. (2017) contribute a method for analogy mining over products.Chan et al. (2018) mine analogies from research papers with respect to their background, purpose, mechanism, and findings.Gilon et al. ( 2018) develop a search engine for expressing and abstracting specific design needs.Recently, Bitton et al. (2023) propose a visual analogies dataset, VASR, where they found that models struggle to find out analogies when given carefully chosen distractors.In computer graphics, some graphics design algorithms take as input an image from the user, and transform it to some other types of visual designs that are similar to the given image, such as embroidery patterns (Zhenyuan et al., 2023) and vector line arts (Mo et al., 2021).This category of work establishes connections between images and application-specific graphics patterns.With images as a guidance, the complicated visual design processes are made easy and intuitive for nonprofessional users.Analogical Reasoning.Large language models (LLMs) have demonstrated impressive abilities in few-shot and zero-shot learning (Kaplan et al., 2020;OpenAI, 2022OpenAI, , 2023)).Recently, Chat-GPT (OpenAI, 2022), GPT-4 (OpenAI, 2023), Alpaca (Taori et al., 2023) and their following works (Chiang et al., 2023;Jiang et al., 2023) have achieved remarkable performance on a wide range of benchmarks.It is believed that they have acquired certain kind of analogical reasoning ability that are not only task-specific (Webb et al., 2022;Ding et al., 2023) , but also omnipresent throughout the prompting process of LLMs, and there are a lot of prompt engineering work to leverage this characteristic for downstream tasks (Jiang et al., 2022;Chan et al., 2023b,a,c;Chan and Chan, 2023).Meanwhile, it is important to note that LLMs also exhibit potential issues related to hallucination, biases, and privacy (Ray, 2023;Li et al., 2023a,b;Wang et al., 2023).Mitigating such issues often requires building up knowledge bases (Cheng et al., 2021;Cui et al., 2021b,a), where analogy could be a useful angle to improve automatic building performance (Chen et al., 2022b).The data and evaluation metrics in this work may serve as a benchmark in evaluating one of the analogical reasoning abilities.
A.5.2 Why the predictions of individual scores are good, but the prediction of α is bad.
Original question: How is it that models are so good at individually predicting EntSim and RelSim (in Section 3.1), but they are not that good at predicting the analogy score α? Since the analogy score is computed from both EntSim and RelSim, the prediction of the analogy score relies on predicting the gap between EntSim and RelSim, which is harder than predicting each similarity alone.A case is presented below to illustrate this.

Figure 1 :
Figure 1: An example of analogy between story S1: the invasion of cells by a virus, and S2: a burglar breaking into a house.

Figure 2 :
Figure 2: The similarity space, showing different kinds of matches in terms of the degree of relation similarity versus entity similarity .According to SMT, we can classify the type of matches (Analogy, Literal similarity, Anomaly, and Mere-appearance) between the source and target story by the two similarities.The figure is an extension with story examples based on the visualization in Gentner and Markman (1997).

Figure 3 :
Figure3: Distributions of EntSim and RelSim on four data domains in STORYANALOGY.Notably, the distributions of EntSim and RelSim on ROCStories tend to skew towards higher values.This could be attributed to the fact that stories from this source primarily revolve human-focused social narratives.
SourceA projectile is affected by gravity.It falls and picks up speed.

Table 1 :
Examples in STORYANALOGY with annotations from each domain.We report the EntSim and RelSim from crowd workers

Table 2 :
STS-style evaluation on different domains of STORYANALOGY.The values represent the Spearman's correlation (%) between the model prediction and scores from dataset (

Table 3 :
An example of the multiple choice question.The goal is to select a candidate story that is the best analogy for the source story.
-xl A rocket is affected by gravity.It falls and picks up speed.

Table 7 :
Examples showing the source story and model generations under zero-shot, few-shot, and finetuning settings.

Table 8 :
Seed analogy examples for generation candidates STORYANALOGY.We sample 5 pairs from propara and 5 paris from rocstories.