FAME: Flexible, Scalable Analogy Mappings Engine

Analogy is one of the core capacities of human cognition; when faced with new situations, we often transfer prior experience from other domains. Most work on computational analogy relies heavily on complex, manually crafted input. In this work, we relax the input requirements, requiring only names of entities to be mapped. We automatically extract commonsense representations and use them to identify a mapping between the entities. Unlike previous works, our framework can handle partial analogies and suggest new entities to be added. Moreover, our method's output is easily interpretable, allowing for users to understand why a specific mapping was chosen. Experiments show that our model correctly maps 81.2% of classical 2x2 analogy problems (guess level=50%). On larger problems, it achieves 77.8% accuracy (mean guess level=13.1%). In another experiment, we show our algorithm outperforms human performance, and the automatic suggestions of new entities resemble those suggested by humans. We hope this work will advance computational analogy by paving the way to more flexible, realistic input requirements, with broader applicability.


Introduction
One of the pinnacles of human cognition is the ability to find parallels across distant domains and transfer ideas between them.This analogous reasoning process enables us to learn new information faster and solve problems based on prior experience (Minsky, 1988;Hofstadter and Sander, 2013;Holyoak, 1984;PJM, 1966).
The most seminal work in computational analogy is Gentner's Structure Mapping Theory (SMT) (Gentner, 1983) and its implementation, Structure Mapping Engine (SME) (Falkenhainer et al., 1989).In a nutshell, SMT assumes input from two domains: base and target.It maps between objects in a base domain and objects in a target domain according to common relational structure, rather than on object attributes.
For example, consider the Rutherford model of the hydrogen atom, where the atom was explained in terms of the (better-understood) solar system (Falkenhainer et al., 1989): a planet revolving around the sun is mapped to an electron revolving around the nucleus.The mapping is due to shared relations between objects (revolving around, being attracted to), not object attributes (round, small).
One of the main criticisms brought against SME and its follow-up work is their need for extensive hand-coded input -structured representations of both the entities and their relations (see Figure 1 for the input to the atom/solar system mapping).Chalmers et al. (1992) argued that too much human creativity is required to construct this input, and the analogy is already effectively given in the representations: "A brief examination [...] shows that the discovery of the similar structure in these representations is not a difficult task.The representations have been set up in such a way that the common structure is immediately apparent.Even for a computer program, the extraction of such common structure is relatively straightforward."Some follow-up works avoid hand-coding LISPlike representations, generating them from sketches (Forbus et al., 2011), qualitative simulators (Dehghani and Forbus, 2009), etc.However, they still require much knowledge engineering, and thus are hard to scale.Nowadays, when the web is full of information about potential domains to transfer ideas from (McNeil Jr and Odón, 2013), such representations do not tap into the potential of web-scale analogies for augmenting human creativity.
The method with the simplest input we are aware of is Latent Relation Mapping Engine (LRME) (Turney, 2008), which requires only two lists of entities to be mapped.Given two entities, they search for phrases containing both in a large corpus and Figure 1: SME representation of the Solar system/Rutherford atom.Reproduced from Falkenhainer et al. (1989).
use them to generate simple patterns.For example, "a sun-centered solar system illustrates" gives rise to patterns such as "a X * Y illustrates".However, such patterns are extremely simple and brittle, and LRME requires exact string matches between the domains (so "revolve around" is different from "rotate around").
In this work, we develop FAME, a Flexible Analogy Mapping Engine.FAME's input requirements are minimal, requiring only two sets of entities.We apply state-of-the-art NLP and IR techniques to automatically infer commonsense relations between the entities using a variety of data sources, and construct a mapping between the domains.Importantly, we do not require identical phrasings of relations.Moreover, our output is interpretable, showing how the mapping was chosen.
Unlike previous works, we drop the strong bijectivity assumption and let the algorithm decide which entities to include in the mapping.Meaning, we allow for entities to remain unmapped.Our algorithm can also generate new suggestions for the non-mapped entities.This paves the road to algorithms that can handle even more limited input -for example, using domain names (solar system, atom) as input, or just a single mapped entity pairs (e.g., turn white blood cells into policemen and see how the analogy unfolds).Our contributions are: • A novel, scalable, and interpretable approach for automatically mapping two domains based on commonsense relational similarities.Our algorithm handles partial mappings and suggests additional entities.• We extend the work of Romero and Razniewski (2020) to discover salient knowl-edge about pairs of entities.• Our model's accuracy is 81.2% on simple, 2x2 problem s(guess level=50%).On larger problems, it achieves 77.8% perfect mappings (guess level=13.1%).In another experiment, we outperform humans (90% vs. 70.2%)and demonstrate that our automatic suggestions resemble human suggestions.We release code and data.1

Problem Definition
An analogy is a mapping from a base domain B into a target domain T .The mapping is based on relations, and not object attributes.Base objects are not mapped into objects that resemble them; rather, there is a common relational structure, and they are mapped to objects that play similar roles.We follow the formulation of Sultan and Shahaf (2022), brought here for completeness: Entities and Relations.Let B = {b 1 , ..., b n } and T = {t 1 , ..., t m } be two sets of entities.For example: B = {sun, Earth, gravity, solar system, Newton}, T = {nucleus, electrons, electricity, atom, Faraday}.Let R be a set of relations.A relation is a set of ordered entity pairs with some meaning.The exact representation is purposely vague, as we do not restrict ourselves to strings, embeddings, etc. Intuitively, relations should capture notions like "revolve around".
In our example, relations between B and T include the Earth revolve around the Sun, like electrons orbit the nucleus; the Earth creates a force field of gravity, similar to electrons creating electric force fields; the Sun and the Earth are part of Table 1: Illustration of a relational analogy between the solar system and the atom.
the solar system, as the nucleus and electrons are part of the atom; Newton discovered gravity, as Faraday is credited with discovering electric force.
Note that relation is an asymmetric function, as the pairs are ordered; e.g., Newton discovered gravity, but gravity did not discover Newton.
Slightly abusing notation, we denote the set of relations that hold between two entities e 1 , e 2 as R(e 1 , e 2 ) ⊆ 2 R .For example, R(Earth, Sun) contains {revolve around, attracted to}, etc.For clarity, we sometimes use R B , R T to emphasize that the entities belong to the B, T domain.
Similarity.Let sim be a similarity metric between two sets of relations, sim Intuitively, when applied to singletons, we want our similarity metric to capture how relations are like each other.For example, "revolve around" is similar to "orbit" and (to a lesser degree) "spiral".When applied to sets of relations, we want sim to be higher if the two sets share many distinct relations.For example, {revolve around, attracted to} should be more similar to {orbit, drawn into} than to {revolve around, orbit} (as the last set does not include any relation similar to attraction).In Section 3.2 we present our sim implementation.
Given one pair from B and one from T , we define similarity in terms of their relations.Since R is asymmetric, we consider both directions: Objective.Our goal is to output a mapping M : B → T ∪ ⊥ such that no two B entities are mapped to the same T entity (Table 1).Mapping into ⊥ means the entity was not mapped to any entity in the T domain.
We look for the mapping M * that captures the best inter-domain analogical structure similarity by maximizing the relational similarity: Note: if b i or b j maps to ⊥, sim * is defined to be 0.

Analogous Matching Algorithm
We wish to find the best mapping from B to T .We first extract relations between entity pairs from the same domain (Section 3.1).Then, we compute similarity between entity pairs that could be mapped (Section 3.2).Finally, we build the mapping (Section 3.3).

Relation Extraction
Automatically extracting relations is a key part of our algorithm, as it eliminates the need for extensive manual curation of the input.We focus on commonsense relations (e.g., the Earth revolves around the Sun), as opposed to situational relations (e.g., the book is on the table).This broadly falls under open information extraction (OIE), the task of generating a structured representation of the information in a text.There has been a lot of work in this area, especially attempts to automate the construction of commonsense datasets (Etzioni et al., 2008(Etzioni et al., , 2004;;Yates et al., 2007;Lenat et al., 1985;Sap et al., 2019).
Given two entities, we automatically extract relations from multiple sources: ConceptNet.A commonsense dataset, containing about 1.5M nodes (Liu and Singh, 2004).For each entity, we receive a list of (predicate, entity), which we filtered to match the second entity (single or plural form).The predicates serve as our relations.Open Information Extraction.A database automatically extracted from a large web corpus (Etzioni et al., 2008).It contains over 5B triplets of the form (subject, predicate, object).We searched for a match between both entities in the (subject, object) fields, and used the predicates as our relations.GPT-3 (text-davinci-001).2We used a generative pretrained large language model (LM) as a knowledge base in a few-shot manner (Petroni et al., 2019;Brown et al., 2020b).We input a prompt of four analogies, e.g., "Q: What are the relations between gravity and Newton?, A: Newton discovered gravity.A: Newton invented gravity."(see Section .*<entity2>").We use questions such as "Why does", "Why did" and "How does".
A.2.3 for the full prompt).GPT-3 outputs up to three sentences per query.We kept only sentences of the form <entity> <text> <entity>, treating the <text> as the relation.Quasimodo.A commonsense knowledge base that focuses on salient properties of objects (Romero and Razniewski, 2020).It contains more than 3.5M triplets of (subject, predicate, object).It considers questions instead of statements.For instance, if people search for an answer to "Why is the sky blue?", this implies that the sky is blue.Whenever our two entities appeared in the (subject, object) fields, we extracted their predicates as relations.Quasimodo++.A relation extraction method that we develop, inspired by Quasimodo.Quasimodo was constructed using questions about a single entity; we extended it to questions exploring relations between pairs of entities.We used Google's query auto-completion to tap into the query logs, asking questions containing both desired entities, such as "How does Earth * Sun", "How is Earth * Sun", and "Why does Sun * Earth" for every pair of entities (see Figure 2 for an example).The exact regular expressions we used can be found in Section A.1.
We presented here the knowledge sources we implemented.We note that our algorithm is easy to extend to new sources and that we expect that its robustness will increase with coverage.

Scoring Entity Pairs
We wish to calculate sim In Section 2 we specified desiderata of sim, especially that it is higher if the two sets share many distinct relations.We now present our implementation of sim.
Without loss of generality, let us consider and each relation in R T (t 1 , t 2 ).We create a complete bipartite graph where the left side nodes are the relations of R B (b 1 , b 2 ), and the right side nodes are the relations of R T (t 1 , t 2 ) (Figure 3).The edge weights (w) are the cosine similarity of the nodes' sBERT embedding (Reimers and Gurevych, 2019).
We remove non-informative relations by extracting the top-frequent n-grams (n = {1, 2, 3, 4}) from Wikipedia and setting their score to zero.Edges that did not reach a threshold (chosen using hyper-parameter search, see Section 3.3) were set to zero.
Next, we cluster similar relations on each side (e.g., "revolve around" and "circle around") to avoid double-counting.We use hierarchical agglomerative clustering based on the cosine embedding similarity (threshold = 0.5; see Section 3.3).The weight of edges between two clusters is the maximal weight of an edge between their nodes (see Figure 3; colors correspond to clusters).
Finally, we apply Maximum-Weight Bipartite Matching on the clusters (see Section 3.3).The similarity score sim ) is defined as the sum of the remaining edges.

Building a Mapping
Using the score mappings between pairs, we can compose larger mappings.We use beam-search, starting from the most promising pair-mappings found in Section 3.2.In each iteration, we expand the 20 most promising partial mappings, testing each possible mapping between single entities of B and T (that are consistent with the current partial Figure 4: A snippet from our UI.Top: Input.Bottom: The mapping our algorithm found (output), is represented as a graph.Nodes correspond to mappings between single entities (e.g., sun to nucleus).Each edge is annotated with some of the shared relations between the mapped pairs corresponding to its endpoints and their similarity score.For the sake of visualization, we show at most two relations for each edge.Edge weight corresponds to strength.mapping -i.e., a B entity cannot map to multiple T entities).When expansions stop increasing the score, we stop the search and select the mapping with the highest score.
Figure 4 shows a snippet from our UI.Input appears on the top.FAME's output mapping is represented as a graph: nodes correspond to single entity mappings (e.g., Sun to nucleus).Edges represent the shared relational structure.Each edge contains some of the shared relations between the mapped pairs corresponding to its endpoints (e.g., "more massive than") and their similarity score (note the edges are directional).To ease visualization, we show at most two relations per edge.The thickness of an edge corresponds to its weight.
A note on the solution space.In previous works n = m and M is a bijective function.Meaning, M is both injective (one-to-one; each element in the target is the image of at most one element in the source) and surjective (onto; all the target terms are covered).In other words, no entity is left unmapped.In that case, the solution space's cardinality is n!.
We allow for n = m and for entities to remain unmapped.Without loss of generality let n ≤ m.The cardinality is then , where i is the number of matched entities.We subtract n • m because we do not allow for a mapping of size 1; our algorithm starts by mapping pairs and then adds single-entity mapping at each iteration of the beam search.
This relaxation of the bijective constraint drastically increases the space; for n = 7, n! = 5, 040, while our space is of size 130, 922.
Hyper-Parameter Search.We constructed a new dataset to set our model's hyper-parameters (See Appendix A).The dataset contains 36 analogical mapping problems created by ten volunteers, not from our research team.We showed them example analogies and asked them to generate new ones.An expert from our team verified their output, discarding 4 analogies.Domain size varied between 3 to 5 (average size=3.4).
On the problems generated by the volunteers, FAME achieves 83.3% perfect mappings (the whole mapping is correct).If we consider single mappings separately, it achieves 89.4% accuracy.

Entity Suggestion
One of the main limitations of previous analogical mapping algorithms is their inability to automatically expand analogies.This is especially interesting in our case, as we allow for unmapped entities; thus, suggesting new entities could identify potential mapping candidates for the unmapped entities.
For example, let B = {Sun, Earth, gravity, New-ton} and T = {nucleus, electron, electricity}.The correct mapping is Sun → nucleus, Earth → electron, gravity → electricity, leaving Newton with no mapping.Our goal is to suggest candidate entities that preserve the relational structure.
Intuitively, we look at the relations Newton shares with other B entities (e.g., discovered gravity), and try to see which T entity plays a corresponding role (i.e., who discovered electricity?).
More formally, suppose we wish to find candidates t * for mapping to b n .We first extract the Table 2: Ablation study on the 2x2 near and far problems and our extended set, leaving out knowledge sources.Results show the importance of the generative LM approach (GPT-3.5)as a knowledge source.

Sources
Open Information Extraction also contributes much, especially for the complex analogies (2x2-far and extended).
We then iterate over all relations r ∈ R b i and use the pair {M(b i ), r} to extract suggestions for t * .We use Open Information Extraction, Quasimodo, and Quasimodo++.While our method was previously used to extract relations given a pair of two entities, we now use it to extract entities given a pair of {entity, relation}.This entails filtering on the predicate field in our commonsense datasets and changing the queries in Quasimodo++.
As suggestions tend to be noisy, we cluster all extracted entities (similarly to the clustering from Section 3.2).We remove clusters of size < 2.
For each suggestion cluster, we rerun our analogous matching algorithm with a representative entity from that cluster (the closest to the cluster's center of mass).We pick the cluster whose representative resulted in the mapping with the highest score.As the commonsense datasets we work with operate mostly on string matching, small changes (e.g., Benjamin Franklin/Ben Franklin) could sometimes result in slightly different results.Thus, we perform one final round, with all entities from our chosen cluster, and pick the highest score mapping.

Evaluation
In this section, we evaluate FAME.We test its ability to identify the correct mapping (Section 5.1), and compared it to both related works (Section 5.2) and human performance (Section 5.3).

Performance on Analogy Problems
2x2 problems.One of the things that might have held computational analogy back is the lack of high-quality, large-scale datasets.Most datasets are small and focus on classical 2x2 problems (A : B :: C : D), similar to SAT questions.
We start by testing FAME on this standard type of analogies.We use 80 problems from Green et al. ( 2010), split into 40 near and 40 far analogies (e.g., for "answer:riddle", near analogy is "solution:problem", far analogy is "key:lock").While the dataset is small, we believe it is still interesting to explore.Our algorithm managed to perfectly map 85% of near analogies and 77.5% of far ones.Random guess baseline is 33.3% (see Section 3.3).
Extended problems.Encouraged by the results of the 2x2 problems, we explore more complex problems.We decided to extend the Green far analogies (which are harder than the near ones).We had three experts go over the dataset together and brainstorm potential extensions.On four problems, the experts did not manage to agree on any additional mappings, leaving us with 36 extended problems (average domain size 3.3).
Our algorithm perfectly mapped 77.8% of the extended problems.Random baseline is 13.1% on average.As we relax the bijection assumption, FAME's average guess level is even lower -2.2% (see Section 3.3).If we look beyond the top-rated solution, our algorithm has the correct solution in its top-2 guesses 83.3% of the time and 91.7% for top-3.
Error analysis.We found 3 main causes of error: • Coverage (for example, we could not find a relation between "hoof" and "hoofprint").This prompted us to ablate the knowledge sources FAME uses (Table 2).Results show the importance of the generative LM approach.
Open IE is also important, especially for the more complex analogies (far and extended).Some sources, such as ConceptNet, did not seem to contribute much.• Noisy relations that are either peculiar or plain wrong (e.g., "a footballer can iron").• Embedding similarity (for example, "produce" and "is produced by" have a high similarity score).This is exacerbated by ambiguity (e.g., the word "pen" referred to "pigpen" and not to the writing instrument).

Comparison to Related Work
SME line of work.We had difficulty comparing FAME to SME (Falkenhainer et al., 1989) and its extensions, due to their complex input requirements.LRME (Turney, 2008) is closest to our setting, but no code or demo is available.Thus, we compare to their published results on a set of 20 problems.
LRME's entities include nouns, verbs, and adjectives.Since FAME expects noun phrases, we filtered out all other input terms (one problem has only a single noun, so we are left with 19 problems).It is hard to compare in this setup (and unfortunately, authors did not report which partial mappings were correct).Still, LRME's accuracy was 75%, whereas FAME achieved 84.2%.
While the size of the problems is smaller when restricted to nouns, we believe the noun-only setting is harder.The verbs and adjectives often provide hints that significantly constrain the search space.For example, in problem A6 (Turney, 2008) (mapping a projectile to a planet) there is one adjective in each domain (parabolic, elliptical).Those adjectives can only apply to one or two of the nouns (i.e., you cannot have parabolic earth, air, or gravity), effectively giving away the noun mapping.
As a side note, we also believe that our nounonly input is a cleaner problem setting, as it is often easier to automatically identify the entities in a domain than to identify the attributes and verbs relevant for the analogy.In the words of LRME's authors, "LRME is not immune to the criticism of Chalmers et al. (1992), that the human who generates the input is doing more work than the computer that makes the mapping."We believe FAME is a step in the right direction in this regard.
Pretrained LMs.In the absence of a baseline, we turn to a generative pretrained large LM known to have impressive commonsense abilities -GPT-3.5 (text-davinci-002).We used 4 random examples from the hyper-parameter search dataset.After some experimentation with prompt engineering, we chose two variants (see A.2.3).
The results are summarized in Table 3. GPT-3.5 does well on the 2x2 datasets (Green et al., 2010).However, both datasets appear on the web, and perhaps GPT-3.5 was exposed to them during training (data leakage).In particular, we found some of the answers via a simple web search (Figure A.6).
Moreover, GPT-3.5'sperformance drops on the extended set, where problems are complex and do not appear on the web.Interestingly, it does not even manage to return a valid mapping in some of the cases.This exercise improves our understanding of FAME's strengths and weaknesses.E-KAR dataset.Chen et al. (2022) recently released a relevant dataset, E-KAR, for rationalizing analogical reasoning.The dataset consists of multiple-choice problems from civil service
exams in China.For example, for the source triplet "tea:teapot:teacup", the correct answer is "talents:school:enterprise".The reasoning is that both teapot and teacup are containers for tea.After the tea is brewed in the teapot, it is transported into the teacup.Similarly, both school and enterprise are organizations.After talents are educated in school, they are transported into enterprise. 3he E-KAR test set has no labels, so we used their validation set (N=119) to test FAME.As our task is different, we only took source entities (as B) and entities from the correct answer (as T ).We filtered questions without nouns, resulting in N=101.
FAME found the right mapping 68.3% of the time.A closer examination of FAME's mistakes revealed that ∼ 75% of them occurred due to relation types that are not at all covered by our framework: either ternary relations (soldier:doctor:military doctor → car:electric vehicle:electric car; the last term is a combination of the first two) or relations based on sharing some attribute (so "both containers for holding tea" is mapped to "both are organizations").Some of the attribute-based mappings work at the whole-set level, so each entity on B could map to each entity on T (yellow:red:white → sad:happy:angry).Thus, we conclude there is a big gap between FAME and E-KAR's assumptions.

Comparison to People
We compare FAME with human thinking in a 2phase experiment. 4In the closed-world phase, the participants received ten structure mapping problems, in which they were asked to match instances from B to T .The domains included between 3-5 entities (Table A .4).Participants were instructed to map each B entity into exactly one T entity.
In the open-world phase, participants received five mapped problems, but one entity was left blank (Table A .5).Participants were instructed to fill in the blank with an entity that preserves the analogy.Participants.We recruited 304 participants using social media.The compensation was a chance to win one of three $30 vouchers.76.6% of our participants were between the ages 18-35 and 17.2% are between 36-45 (self-reported).Closed-world mapping.FAME missclassified one problem compared to gold standard (A9, Table A.4), achieving 90% accuracy (human baseline was 70.2%; see full distribution in Table A .4).
Problem A6 has the lowest human accuracy (35.5%), and is also the largest one (|B| = |T | = 5).A closer examination of its confusion matrix reveals that while FAME correctly mapped water to heat and pressure to temperature, 15% of people switched the two.This might be due to the strong semantic pairing of water and temperature.FAME is immune to this, as it relays on relations.
On average, each participant mapped the problem the same as FAME 78% of the times.Overall, FAME outperforms humans, and most of the disagreement is due to human errors.Open-world entity suggestion.We presented participants with five mapped problems where one entity was left blank (Table A.5) and asked them to fill in the black while preserving the analogy.
For all five problems, an entity from FAME's top two completions appeared in humans' top three completions (Table A.6). Meaning, our algorithm's top suggestions are similar to humans'.Only in one example (B5) one of the top two algorithm's completions appeared third in humans' (in the rest it is first or second).We suspect that this confusion in B5 occurred because gravity and Newton reminded participants of the term apple.
Figure 5 shows a word cloud for answers to problem B1.While most responses are quite similar, some participants returned creative and appropriate solutions (e.g., treasure chest, jewelry box, car).
Symbolic approaches usually represent input as structured sets of logic statements.Our work falls  A .6).While most responses were from the same semantic domain, some were creative and appropriate (e.g., treasure chest, jewelry box, car).under this branch, as well as SME (Falkenhainer et al., 1989) and its follow-up work.LRME (Turney, 2008) is the closest to our work, as it automatically extracts the relations.Unlike FAME, LRME requires exact matches of relations across different domains.We also focus on nouns only, making the problem harder, and relax the bijection assumption, allowing for automatically extending analogies.NLP.Analogy-making received relatively little attention in NLP.The best-known task is word analogies, often used to measure embeddings' quality (inspired by Word2Vec's "king -man + woman = queen" example (Mikolov et al., 2013)).Follow-up work explored embeddings' linear algebraic structure (Arora et al., 2016;Gittens et al., 2017;Allen and Hospedales, 2019) or compositional nature (Chiang et al., 2020), neglecting relational similarity.A recent work on analogies between procedural texts (Sultan and Shahaf, 2022) did study relational similarity, but extracted the relations from the input texts, with no commonsense augmentations.
Recently, there have been efforts to study LMs' analogical capabilities (Ushio et al., 2021;Brown et al., 2020a).Findings indicate they struggle with abstract and complex relations and results depend strongly on LM's architecture and parameters.Kittur et al. (2019) combined NLP and crowds for product analogies without explicitly modeling entities and relations, but instead automatically extracting schemas of the product.

Conclusions and Future Work
Detecting deep structural similarity across distant domains and transferring ideas between them is central to human thinking.We presented FAME, a novel method for analogy making.Compared to previous works, FAME is more expressive, scal-able, robust and interpretable.It also allows partial matches and automatic entity suggestions to extend the analogies.
FAME correctly maps 81.2% of classical 2x2 analogy problems.
On larger problems, it achieves 77.8% perfect mappings (mean guess level=13.1%).FAME also outperforms humans in solving analogy mapping problems (90% vs. 70.2%).Interestingly, our automatic suggestions of new entities resemble those suggested by humans.
In future work, we plan to improve coverage and extend our framework to more than just binary relations, as sometimes the key to an analogy is a relation involving more than two objects.In addition, we plan to improve our similarity measure, to address both context (to solve ambiguity) and the difference between active and passive relations.We plan to explore different forms of input, such as algorithms that take as input very partial domains, perhaps even just domain names (e.g., solar system, atom) and populate the domains with entities, or algorithms incorporating user feedback.
To conclude, we hope FAME will pave the way for analogy-making algorithms that require lessrestrictive inputs and can scale up and tap into the vast amount of potential inspiration the web offers, augmenting human creativity.

Ethical Considerations & Limitations
While FAME can assist humans by inspiring nontrivial solutions to problems, it has been shown that humans struggle with detecting caveats in presented analogies (Holyoak et al., 1995).For example, the cardiovascular system is often taught to medical students in terms of water supply system (Swain, 2000).However, this analogy might also confuse them, as it ignores important differences between water and blood (e.g., blood clots).Thus, while our output is interpretable, it might still mislead people, and it is important to alert the users to this possibility.
Another issue is the fact that FAME's coverage highly depends on external resources (ConceptNet, Google AutoComplete, etc.).This might be particularly problematic when applied to low-resource languages.As the relations we look for are commonsense relations, rather than cultural or situational ones, using automatic translation might ameliorate the problem.
Lastly, we also note these resources evolve over time, and thus if one is interested in reproducibility, it is necessary to save snapshots of the extracted relations.
anonymity concerns (username).We will include it in the non-anonymized version.
We provide a React based web interface, currently available only locally.This system is used to visualize the graphs created by the algorithm's mapping output.In addition, it visualizes the relations between entities, their similarity, and the clustering.This interface is useful for assisting in developing, debugging and understanding the algorithm's output.The demo is accessible using our repository 1 .

A.4 Experiments
Snippets of the experimental setup (including instructions) can be found in Figures 7, 8.
Table 4 depicts the ten analogical proportion problems used in the structure mapping experiment (closed-world mappings in Section 5.2).Accuracy denotes the percentage of human participants who mapped from B to T correctly.Results show this task is non-trivial even for humans.
Table 6 illustrates the experimental setup for the second phase of our experiment, in which participants received a solved mapping problem with one entity left out (open-World in Section 5.2).
Table 5 contains all solved analogy problems used in the second phase of the experiment (entity suggestion, see open-World in Section 5.2).Participants were given with the complete mapping, but with a missing entity (as presented here).

A.5 E-kar
Table 7 shows an example of a problematic problem from E-KAR dataset.

Figure 3 :
Figure 3: Left: partial relations of Earth:sun.Right: partial relations of electron:nucleus.This is the result of the maximum weighted match on the clusters.Colors correspond to clusters.

Figure 5 :
Figure 5: Word cloud of human completions for B1 (TableA.6).While most responses were from the same semantic domain, some were creative and appropriate (e.g., treasure chest, jewelry box, car).

Figure 7 :
Figure 7: Closed-World Mapping: Experiment instructions with the first question.

Figure 8 :
Figure 8: Open-World Entity Suggestion: Experiment instructions with the first question.