BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?

Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as"eye is to seeing what ear is to hearing", sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.


Introduction
One of the most widely discussed properties of word embeddings has been their surprising ability to model certain types of relational similarities in terms of word vector differences (Mikolov While the title is probably self-explanatory, this is a small note explaining it. BERT is to NLP what AlexNet is to CV is making an analogy on what the BERT and AlexNet models represented for Natural Language Processing (NLP) and Computer Vision (CV), respectively. They both brought a paradigm shift in how research was undertaken in their corresponding disciplines and this is what the analogy refers to.
1 Source code and data to reproduce our experimental results are available in the following repository: https://github.com/asahi417/ analogy-language-model Query: word:language Candidates: (1) paint:portrait (2) poetry:rhythm (3) note:music (4) tale:story (5) week:year  Vylomova et al., 2016;Allen and Hospedales, 2019;Ethayarajh et al., 2019). The underlying assumption is that when "a is to b what c is to d" the word vector differences b − a and d − c are expected to be similar, where we write x for the embedding of a word x. While this assumption holds for some types of syntactic relations, for semantic relations this holds to a much more limited degree than was suggested in early work (Linzen, 2016;Schluter, 2018). Moreover, the most commonly used benchmarks have focused on specific and well-defined semantic relations such as "capital of", rather than the more abstract notion of relational similarity that is often needed for solving the kind of psychometric analogy problems that can be found in IQ tests and educational settings.
An example of such a problem is shown in Table 1. Given the central role of analogy in human cognition, it is nonetheless important to understand the extent to which NLP models are able to solve these more abstract analogy problems. Besides its value as an intrinsic benchmark for lexical semantics, the ability to recognize analogies is indeed important in the contexts of human creativity (Holyoak et al., 1996), innovation (Hope et al., 2017), computational creativity (Goel, 2019) and education (Pardos and Nam, 2020). Analogies are also a prerequisite to build AI systems for the legal domain (Ashley, 1988;Walton, 2010) and are used in machine learning (Miclet et al., 2008;Hug et al., 2016;Hüllermeier, 2020) and for ontology alignment (Raad and Evermann, 2015), among others.
Within NLP, however, the task of recognizing analogies has received relatively little attention. To solve such problems, Turney (2005) proposed Latent Relational Analysis (LRA), which was essentially designed as a relational counterpart to Latent Semantic Analysis (Landauer and Dumais, 1997). Somewhat surprisingly, perhaps, despite the substantial progress that word embeddings and language models (LMs) have enabled in NLP, LRA still represents the current state-of-the-art in solving abstract word analogy problems. When going beyond a purely unsupervised setting, however, GPT-3 was recently found to obtain slightly better results (Brown et al., 2020).
The aim of this paper is to analyze the ability of pre-trained LMs to recognize analogies. Our focus is on the zero-shot setting, where LMs are used without fine-tuning. To predict whether two word pairs (a, b) and (c, d) are likely to be analogical, we need a prompt, i.e. a template that is used to construct the input to the LM, and a scoring function. We extensively analyze the impact of both of these choices, as well as the differences between different LMs. When the prompt and scoring function are carefully calibrated, we find that GPT-2 can outperform LRA, standard word embeddings as well as the published results for GPT-3 in the zero-shot setting. However, we also find that these results are highly sensitive to the choice of the prompt, as well as two hyperparameters in our scoring function, with the optimal choices not being consistent across different datasets. Moreover, using BERT leads to considerably weaker results, underperforming even standard word embeddings in all of the considered configurations. These findings suggest that while transformer-based LMs learn relational knowledge to a meaningful extent, more work is needed to understand how such knowledge is encoded, and how it can be exploited. Since their recent dominance in standard NLP benchmarks (Peters et al., 2018a;Devlin et al., 2019;Liu et al., 2019), pre-trained language models have been extensively studied. This has mainly been done through probing tasks, which are aimed at understanding the knowledge that is implicitly captured by their parameters. After the initial focus on understanding pre-trained LSTM-based LMs (Peters et al., 2018b), attention has now shifted toward transformer-based models. The main aspects that have been studied in recent years are syntax (Goldberg, 2019;Saphra and Lopez, 2019;Hewitt and Manning, 2019;van Schijndel et al., 2019;Jawahar et al., 2019;Tenney et al., 2019b) and semantics (Ettinger, 2019;Tenney et al., 2019a). For a more complete overview on analyses of the different properties of transformer-based LMs, we refer to Rogers et al. (2021).
Despite the rise in probing analyses for LMs and the importance of analogical reasoning in human cognition, understanding the analogical capabilities of LMs remains understudied. The most similar works have focused on capturing relational knowledge from LMs (in particular the type of information available in knowledge graphs). For instance, Petroni et al. (2019) analyzed to what extent LMs could fill manually-defined templates such as "Dante was born in [MASK]". Follow-up works extended this initial approach by automatically generating templates and fine-tuning LMs on them (Bouraoui et al., 2020;Jiang et al., 2020), showing an improved performance. In this paper, we focus on the analogical knowledge that is encoded in pre-trained LMs, without the extra step of fine-tuning on additional data.

Word Analogy Probing
Word analogies have been used as a standard intrinsic evaluation task for measuring the quality of word embeddings. Mikolov et al. (2013b) showed that word embeddings, in particular Word2vec embeddings, were able to solve analogy problems by simple vector operations (e.g. king -man + woman = queen). The motivation for this task dates back to the connectionism theory (Feldman and Ballard, 1982) in cognitive science. In particular, neural networks were thought to be able to model emergent concepts (Hopfield, 1982;) by learning distributed representations across an embedding space , similar to the properties that word embeddings displayed in the analogy task. More recent works have proposed new mathematical theories and experiments to understand the analogical capabilities of word embeddings, attempting to understand their linear algebraic structure (Arora et al., 2016;Gittens et al., 2017;Allen and Hospedales, 2019) or by explicitly studying their compositional nature (Levy and Goldberg, 2014;Paperno and Baroni, 2016;Ethayarajh et al., 2019;Chiang et al., 2020).
However, recent works have questioned the impressive results displayed by word embeddings in this task. In many cases simple baselines excluding the input pair (or query) were competitive (Linzen, 2016). Simultaneously, some researchers have found that many relationships may not be retrieved in the embedding space by simple linear transformations (Drozd et al., 2016;Bouraoui et al., 2018) and others argued that the standard evaluation procedure has limitations (Schluter, 2018). New datasets and measures have also been introduced to address some of these issues (Gladkova et al., 2016;Fournier et al., 2020). Finally, in the context of bias detection, for which analogies have been used as a proxy (Bolukbasi et al., 2016), it has also been found that word analogies may misguide or hide the real relationships existing in the vector space (Gonen and Goldberg, 2019;Nissim et al., 2020).
As far as language models are concerned, word analogies have not been explored to the same extent as for word embeddings. Recently, Brown et al. (2020) evaluated the unsupervised capabilities of GPT-3 by evaluating it on the SAT analogies dataset (Turney et al., 2003), which we also include in our evaluation (see Section 3.2). However, the evaluation is limited to a single dataset (i.e., SAT) and model (i.e., GPT-3), and the general capabilities of language models were not investigated.
Despite their limitations, analogy tests remain appealing for evaluating the ability of embeddings and language models to identify abstract relationships. To mitigate the aforementioned methodological issues, in this work we rely on analogy tests from educational resources, where the task is to complete analogical proportions, given only the first word pair. In contrast, word embedding models have mostly been evaluated using a predictive task, in which three of the four words are given. Moreover, the considered datasets are focused on abstract analogies, whereas the most commonly used datasets only include well-defined semantic relations such as "capital of". For completeness, however, we also show results on these standard datasets. We furthermore experiment with several simple baselines to understand possible artifacts present in the different datasets.

Word Analogies
In this section, we describe the word analogy formulation that is used for our experiments (Section 3.1). Subsequently, we provide an overview of the datasets used in our experiments (Section 3.2).

Task Description
We frame the analogy task in terms of analogical proportions (Prade and Richard, 2017). Given a query word pair (h q , t q ) and a list of candidate answer pairs {(h i , t i )} n i=1 , the goal is to find the candidate answer pair that has the most similar relation to the query pair. Table 1 shows a sample query and candidate answers drawn from one of the datasets used in our evaluation (see Section 3.2).

Analogy Datasets
We split analogy datasets in two types, based on how the analogy problems were constructed.

Psychometric Analogy Tests
Word analogy tests are commonly used in assessments of linguistic and cognitive ability. For instance, in the past, such tests were included in the SAT exams, which are a US college admission test. Turney et al. (2003) collected a benchmark of 374 word analogy problems, consisting primarily of problems from these SAT tests. Aimed at college applicants, these problems are designed to be challenging for humans. A key challenge for NLP systems is that solving these problems often requires identifying fine-grained semantic differences between word pairs that belong to the same coarse-grained relation. For instance, in the case of Table 1, we could say that "a year consists of weeks" like "language consists of words", but the week-year pair is nonetheless less similar to wordlanguage than note-music.
Another analogy benchmark was constructed by Boteanu and Chernova (2015), who used word analogy problems from an educational resource 2 . They used in particular UNIT 2 of the analogy problems from the educational site. These problems have the same form as those from the SAT benchmark, but rather than college applicants, they are aimed at children in grades 4 to 12 from the US school system (i.e. from age 9 onwards). In this paper, we will also include this UNIT 2 benchmark. Moreover, we have collected another benchmark from  the UNIT 4 problems on the same website. These UNIT 4 problems are organised in 5 difficulty levels: high-beginning, low-intermediate, highintermediate, low-advanced and high-advanced. The low-advanced level is stated to be at the level of the SAT tests, whereas the high-advanced level is stated to be at the level of the GRE test (which is used for admission into graduate schools).

Lexical Semantics Benchmarks
Since the introduction of Word2vec (Mikolov et al., 2013a), the problem of modelling analogies has been commonly used as an intrinsic benchmark for word embedding models. However, the datasets that have been used in that context are focused on well-defined and relatively coarse-grained relations. The Google analogy dataset (Mikolov et al., 2013b) has been one of the most commonly used benchmarks for intrinsic evaluation of word embeddings. This dataset contains a mix of semantic and morphological relations such as capital-of and singular-plural, respectively. However, its coverage has been shown to be limiting, and BATS (Gladkova et al., 2016) was developed in an attempt to address its main shortcomings. BATS includes a larger number of concepts and relations, which are split into four categories: lexicographic, encyclopedic, and derivational and inflectional morphology. As pointed out above, these datasets were tailored to the evaluation of word embeddings in a predictive setting. To provide an evaluation setting which is comparable to the benchmarks obtained from human analogy tests, we constructed word analogy problems from the Google and BATS datasets, by choosing for each correct analogy pair a number of negative examples. The resulting benchmark thus follows the same format as described in Section 3.1. To obtain sufficiently challenging negative examples, for each query pair (e.g. Paris-France) we extracted three negative in- Figure 1: Solving a word analogy problem by selecting one with the highest LM score among the candidates. stances: (1) two random words from the head of the input relation type (e.g. Rome-Oslo); (2) two random words from the tail of the input relation type (e.g. Germany-Canada); (3) a random word pair from a relation type of the same high-level category as the input relation type (e.g. Argentina-peso). 3 Table 2 provides an overview of our datasets. The instances from each dataset are organised into groups. In the case of Google and BATS, these groups refer to the relation types (e.g. semantic or morphological in the case of Google). In the case of UNIT 2 and UNIT 4, the groups refer to the difficulty level. For the SAT dataset, we consider two groups, capturing whether the instances come from an actual SAT test or not. Finally, we randomly sample 10% of each group in each dataset to construct a validation set, and regard the remaining data as the test set.

Methodology
In this section, we explain our strategy for using pretrained LMs to solve analogy problems without fine-tuning. First, in Section 4.1 we explain how each relation pair is converted into a natural sentence to be fed into the LM. In Section 4.2, we then discuss a number of scoring functions that can be used to select the most plausible answer candidate. Finally, we take advantage of the fact that analogical proportion is invariant to particular permutations, which allows for a natural extension of the proposed scoring functions (Section 4.3). Figure 1 shows a high-level overview of our methodology.

Relation Pair Prompting
We define a prompting function T t (w 1 , w 2 , w 3 , w 4 ) that takes four placeholders and a template type t, and returns a sentence in which the placeholders were replaced by the words w 1 , w 2 , w 3 , and w 4 . For instance, given a query "word:language" and a candidate "note:music", the prompting function produces T to-as ("word", "language", "note", "music") = "word is to language as note is to music" where we use the template type to-as here.
Using manually specified template types can result in a sub-optimal textual representation. For this reason, recent studies have proposed autoprompting strategies, which optimize the template type on a training set (Shin et al., 2020), paraphrasing (Jiang et al., 2020), additional prompt generation model (Gao et al., 2020), and corpus-driven template mining (Bouraoui et al., 2020). However, none of these approaches can be applied to unsupervised settings. Thus, we do not explore auto-prompting methods in this work. Instead, we will consider a number of different template types in the experiments, and assess the sensitivity of the results to the choice of template type.

Scoring Function
Perplexity. We first define perplexity, which is widely used as a sentence re-ranking metric (Chan et al., 2016;Gulcehre et al., 2015). Given a sentence x, for autoregressive LMs such as LSTM based models (Zaremba et al., 2014) and GPTs (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020), perplexity can be computed as is the likelihood from an autoregressive LM's next token prediction. For masked LMs such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), we instead use pseudoperplexity, which is defined as in (1) (Wang and Cho, 2019) that the masked token is x j . PMI. Although perplexity is well-suited to capture the fluency of a sentence, it may not be the best choice to test the plausibility of a given analogical proportion candidate. As an alternative, we propose a scoring function that focuses specifically on words from the two given pairs. To this end, we propose to use an approximation of point-wise mutual information (PMI), based on perplexity.
PMI is defined as the difference between a conditional and marginal log-likelihood. In our case, we consider the conditional likelihood of t i given h i and the query pair (recall from Section 3.1 that h and t represent the head and tail of a given word pair, respectively), i.e. P (t i |h q , t q , h i ), and the marginal likelihood over h i , i.e. P (t i |h q , t q ). Subsequently, the PMI-inspired scoring function is defined as where α is a hyperparameter to control the effect of the marginal likelihood. The PMI score corresponds to the specific case where α = 1. However, Davison et al. (2019) found that using a hyperparameter to balance the impact of the conditional and marginal probabilities can significantly improve the results. The probabilities in (2) are estimated by assuming that the answer candidates are the only possible word pairs that need to be considered. By relying on this closed-world assumption, we can estimate marginal probabilities based on perplexity, which we found to give better results than the masking based strategy from Davison et al. (2019).
In particular, we estimate these probabilities as where n is the number of answer candidates for the given query. Equivalently, since PMI is symmetric, we can consider the difference between the logs of P (h i |h q , t q , t i ) and P (h i |h q , t q ). While this leads to the same PMI value in theory, due to the way in which we approximate the probabilities, this symmetric approach will lead to a different score. We thus combine both scores with an aggregation function A g . This aggregation function takes a list of scores and outputs an aggregated value. As an example, given a list [1, 2, 3, 4], we write A mean ([1, 2, 3, 4]) = 2.5 for the mean and A val 1 ([1, 2, 3, 4]) = 1 for the first element. Given such an aggregation function, we define the following PMI-based score where we consider basic aggregation operations over the list r = [r(t i |h i , h q , t q ), r(h i |t i , h q , t q )], such as the mean, max, and min value. The choice of using only one of the scores r(t i |h i , h q , t q ), r(h i |t i , h q , t q ) is viewed as a special case, in which the aggregation function g simply returns the first or the second item. mPPL. We also experiment with a third scoring function, which borrows ideas from both perplexity and PMI. In particular, we propose the marginal likelihood biased perplexity (mPPL) defined as where α t and α h are hyperparameters, and s PPL is a normalized perplexity defined as .
The mPPL score extends perplexity with two bias terms. It is motivated from the insight that treating α as a hyperparameter in (2) can lead to better results than fixing α = 1. By tuning α t and α h , we can essentially influence to what extent answer candidates involving semantically similar words to the query pair should be favored.

Permutation Invariance
The  Figure 2. To take advantage of the different permutations of analogical proportions, we propose the following Analogical Proportion (AP) score: where P and N correspond to the list of positive and negative permutations of the candidate analogical proportion h q : t q :: h i : t i in the order shown in Figure 2, β is a hyperparameter to control the impact of the negative permutations, and s(a, b|c, d) is a scoring function as described in Section 4.2. Here A gpos and A gneg refer to the aggregation functions that are used to combine the scores for the positive and negative permutations respectively, where these aggregation functions are defined as in Section 4.2. To solve an analogy problem, we simply choose the answer candidate that results in the highest value of AP(t i , h i , h q , t q ).

Evaluation
In this section, we evaluate language models on the five analogy datasets presented in Section 3.

Experimental Setting
We consider three transformer-based LMs of a different nature: two masked LMs, namely BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), and GPT-2, as a prominent example of an autoregressive language model. Each pretrained model was fetched from the Huggingface transformers library (Wolf et al., 2019), from which we use bert-large-cased, roberta-large, and gpt2-xl respectively. For parameter selection, we run grid search on β, α, α h , α t , t, g, g pos , and g neg for each model and select the configuration which achieves the best accuracy on each validation set. We experiment with the three scoring functions presented in Section 4.2, i.e., s PPL (perplexity),  Table 3: Accuracy results on each analogy dataset, categorized into language models (LM), word embeddings (WE), and baselines (Base). All LMs use the analogical proportion (AP) function described in Section 4.3. The default configuration for AP includes α = α h = α t = β = 0, g pos = g = val 1 , and t = to-as. Note that s PPL = s mPPL with the default configuration. Average accuracy (Avg) across datasets is included in the last column.

Model
s PMI and s mPPL . Possible values for each hyperparameter (including the selection of six prompts and an ablation test on the scoring function) and the best configurations that were found by grid search are provided in the appendix. As baseline methods, we also consider three pre-trained word embedding models, which have been shown to provide competitive results in analogy tasks, as explained in Section 2.2: Word2vec (Mikolov et al., 2013a), GloVe (Pennington et al., 2014), andFastText (Bojanowski et al., 2017). For the word embedding models, we simply represent word pairs by taking the difference between their embeddings 4 . We then choose the answer candidate with the highest cosine similarity to the query in terms of this vector difference. To put the results into context, we also include two simple statistical baselines. First, we report the expected random performance. Second, we use a method based on each word pair's PMI in a given corpus. We then select the answer candidate with the highest PMI as the prediction. Note that the query word pair is completely ignored in this case. This PMI score is the well-known word-pair association metric introduced by Church and Hanks (1990) for lexicographic purposes (specifically, collocation extraction), which compares the probability of observing two words together with the probabilities of observing them independently (chance). The PMI scores in our experiments were computed using the English Wikipedia with a fixed window size 10.

Results
Table 3 shows our main results. As far as the comparison among LMs is concerned, RoBERTa and GPT-2 consistently outperform BERT. Among the AP variants, s mPPL achieves substantially better results than s PMI or s PPL in most cases. We also observe that word embeddings perform surprisingly well, with FastText and GloVe outperforming BERT on most datasets, as well as GPT-2 and RoBERTa with default hyperparameters. FastText achieves the best overall accuracy on the Google dataset, confirming that this dataset is particularly well-suited to word embeddings (see Section 2.2).  In order to compare with published results from prior work, we carried out an additional experiment on the full SAT dataset (i.e., without splitting it into validation and test). Table 4 shows the results. GPT-3 (Brown et al., 2020) and LRA (Turney, 2005) are added for comparison. Given the variability of the results depending on the tuning procedure, we have also reported results of configurations that were tuned on the entire set, to provide an upper bound on what is possible within the proposed unsupervised setting. This result shows that even with optimal hyperparameter values, LMs barely outperform the performance of the simpler LRA model. GPT-3 similarly fails to outperform LRA in the zero-shot setting.

Analysis
We now take a closer look into our results to investigate parameter sensitivity, the correlation between model performance and human difficulty levels, and possible dataset artifacts. The following analysis focuses on s mPPL as it achieved the best results among the LM based scoring functions. Parameter Sensitivity We found that optimal values of the parameters α and β are highly dependent on the dataset, while other parameters such as the template type t vary across LMs. On the other hand, as shown in Figure 3, the optimal permutations of the templates are relatively consistent, with the original ordering a : b :: c : d typically achieving the best results. The results degrade most for permutations that mix the two word pairs (e.g. a : c :: b : d). In the appendix we include an ablation study for the sensitivity and relevance of other parameters and design choices.
Difficulty Levels To increase our understanding of what makes an analogy problem difficult for LMs, we compare the results for each difficulty level. 5 Recall from Section 3.2 that the U2 and U4 datasets come from educational resources and are split by difficulty level. Figure 4 shows the results of all LMs (tuned setting), FastText and the PMI baseline according to these difficulty levels. Broadly speaking, we can see that instances that are harder for humans are also harder for the considered models. The analogies in the most difficult levels are generally more abstract (e.g. witness : testimony :: generator : electricity), or contain obscure or infrequent words (e.g. grouch : cantakerous :: palace : ornate). 6 Figure 4: Test accuracy in U2 and U4 per difficulty level. LMs use s mPPL with the best configuration tuned in the corresponding validation sets.
Hypothesis Only Recently, several researchers have found that standard NLP benchmarks, such as SNLI (Bowman et al., 2015) for language inference, contain several annotation artifacts that makes the task simpler for automatic models (Poliak et al., 2018;Gururangan et al., 2018). One of their most relevant findings is that models which do not even consider the premise can reach high accuracy. More generally, these issues have been found to be problematic in NLP models (Linzen, 2020) and neural networks more generally (Geirhos et al., 2020). According to the results shown in Table 3, we already found that the PMI baseline achieved a non-trivial performance, even outperforming BERT in a few settings and datasets. This suggests that several implausible negative examples are included in the analogy datasets. As a further exploration of such artifacts, here we analyse the analogue of a hypothesis-only baseline. In particular, for this analysis, we masked the head or tail of the candidate answer in all evaluation instances. Then, we test the masked language models with the same AP con- figuration and tuning on these artificially-modified datasets.As can be seen in Table 5, a non-trivial performance is achieved for all datasets, which suggests that the words from the answer pair tend to be more similar to the words from the query than the words from negative examples.

Conclusion
In this paper, we have presented an extensive analysis of the ability of language models to identify analogies. To this end, we first compiled datasets with psychometric analogy problems from educational resources, covering a wide range of difficulty levels and topics. We also recast two standard benchmarks, the Google and BATS analogy datasets, into the same style of problems. Then, we proposed standard techniques to apply language models to the unsupervised task of solving these analogy problems. Our empirical results shed light on the strengths and limitations of various models.
To directly answer the question posed in the title, our conclusion is that language models can identify analogies to a certain extent, but not all language models are able to achieve a meaningful improvement over word embeddings (whose limitations in analogy tasks are well documented). On the other hand, when carefully tuned, some language models are able to achieve state-of-the-art results. We emphasize that results are highly sensitive to the chosen hyperparameters (which define the scoring function and the prompt among others). Further research could focus on the selection of these optimal hyperparameters, including automatizing the search or generation of prompts, along the lines of Bouraoui et al. (2020) and Shin et al. (2020), respectively. Finally, clearly LMs might still be able to learn to solve analogy tasks when given appropriate training data, which is an aspect that we leave for future work.

A Experimental Details
In our grid search to find the optimal configuration for each dataset and language model, each parameter was selected within the values shown in Table 6. As the coefficient of marginal likelihood α, α h , α t , we considered negative values as well as we hypothesized that the marginal likelihood could be beneficial for LMs as a way to leverage lexical knowledge of the head and tail words. Additionally, Table 7 shows the set of custom templates (or prompts) used in our experiments. Finally, Tables 8, 9, and 10 include the best configuration based on each validation set in for s PMI , s mPPL and the hypothesis-only baseline, respectively.

B Additional Ablation Results
We show a few more complementary results to our main experiments.

B.1 Alternative Scoring Functions
As alternative scoring functions for LM, we have tried two other scores: PMI score based on masked token prediction (Davison et al., 2019) (Mask PMI) and cosine similarity between the embedding difference of a relation pair similar to what used in word-embedding models. For embedding method, we give a prompted sentence to LM to get the last layer's hidden state for each word in the given pair and we take the difference between them, which we regard as the embedding vector for the pair. Finally we pick up the most similar candidate in terms of the cosine similarity with the query embedding. Ta-   B.2 Parameter Sensitivity: template type t Figure 5 shows the box plot of relative improvement across all datasets grouped by t and the results indicate that there is a mild trend that certain templates tend to perform well, but not significant universal selectivity can be found across datasets.
B.3 Parameter Sensitivity: aggregation method g neg Figure 6 shows the box plot of relative improvement across all datasets grouped by g neg . Unlike g pos we show in Figure 3, they do not give a strong signals over datasets.   Figure 7 shows the results of different language models with the s mPPL scoring function on the different categories of the BATS and Google datasets. Table 12 shows all examples from the U2 dataset of the easiest difficuly (i.e. grade 4), which were misclassified by RoBERTa, with s mPPL tuned on the validation set. We can see a few typical issues with word embeddings and language models. For instance, in the first example, the model confuses the antonym pair right:wrong with synonymy. In the second example, we have that someone who is poor lacks money, while someone who is hungry lacks food. However, the selected candidate pair is hungy:water rather than hungry:food, which is presumably chosen because water is assumed to be a near-synonym of food. In the third example (wrench:tool), the hypnernymy relation is confused with a meronymy relation in the selected candidate tree:forest. In the last three examples, the model has selected answers which seem reasonable. In the fourth example, beautiful:pretty, terrible:bad and brave:valiant can all be considered to be synonym pairs. In the fifth example, vehicle:transport is clearly the correct answer, but the pair song:sing is nonetheless relationally similar to shield:protect.

C Error Analysis
In the last example, we can think of being sad as an emotional state, like being sick is a health state, which provides some justification for the predicted answer. On the other hand, the gold answer is based on the argument that someone who is sick lacks health like someone who is scared lacks courage.