Distilling Relation Embeddings from Pre-trained Language Models

Pre-trained language models have been found to capture a surprisingly rich amount of lexical knowledge, ranging from commonsense properties of everyday concepts to detailed factual knowledge about named entities. Among others, this makes it possible to distill high-quality word vectors from pre-trained language models. However, it is currently unclear to what extent it is possible to distill relation embeddings, i.e. vectors that characterize the relationship between two words. Such relation embeddings are appealing because they can, in principle, encode relational knowledge in a more fine-grained way than is possible with knowledge graphs. To obtain relation embeddings from a pre-trained language model, we encode word pairs using a (manually or automatically generated) prompt, and we fine-tune the language model such that relationally similar word pairs yield similar output vectors. We find that the resulting relation embeddings are highly competitive on analogy (unsupervised) and relation classification (supervised) benchmarks, even without any task-specific fine-tuning. Source code to reproduce our experimental results and the model checkpoints are available in the following repository: https://github.com/asahi417/relbert


Introduction
One of the most widely studied aspects of word embeddings is the fact that word vector differences capture lexical relations (Mikolov et al., 2013a). While not being directly connected to downstream performance on NLP tasks, this ability of word embeddings is nonetheless important. For instance, understanding lexical relations is an important prerequisite for understanding the meaning of compound nouns (Turney, 2012). Moreover, the ability of word vectors to capture semantic relations has enabled a wide range of applications beyond NLP, including flexible querying of relational databases (Bordawekar and Shmueli, 2017), schema match-ing (Fernandez et al., 2018), completion and retrieval of Web tables (Zhang et al., 2019), ontology completion (Bouraoui and Schockaert, 2019) and information retrieval in the medical domain (Arguello Casteleiro et al., 2020). More generally, relational similarity (or analogy) plays a central role in computational creativity (Goel, 2019), legal reasoning (Ashley, 1988;Walton, 2010), ontology alignment (Raad and Evermann, 2015) and instance-based learning (Miclet et al., 2008).
Given the recent success of pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020), we may wonder whether such models are able to capture lexical relations in a more faithful or fine-grained way than traditional word embeddings. However, for language models (LMs), there is no direct equivalent to the word vector difference. In this paper, we therefore propose a strategy for extracting relation embeddings from pre-trained LMs, i.e. vectors encoding the relationship between two words. On the one hand, this will allow us to gain a better understanding of how well lexical relations are captured by these models. On the other hand, this will also provide us with a practical method for obtaining relation embeddings in applications such as the ones mentioned above.
Since it is unclear how LMs store relational knowledge, rather than directly extracting relation embeddings, we first fine-tune the LM, such that relation embeddings can be obtained from its output. To this end, we need a prompt, i.e. a template to convert a given word pair into a sentence, and some training data to fine-tune the model. To illustrate the process, consider the word pair Paris-France. As a possible input to the model, we could use a sentence such as "The relation between Paris and France is <mask>". Note that our aim is to find a strategy that can be applied to any pair of words, hence the way in which the input is represented needs to be sufficiently generic. We then fine-tune the LM such that its output corresponds to a relation embedding. To this end, we use a crowdsourced dataset of relational similarity judgements that was collected in the context of SemEval 2012 Task 2 (Jurgens et al., 2012). Despite the relatively small size of this dataset, we show that the resulting fine-tuned LM allows us to produce high-quality relation embeddings, as confirmed in our extensive evaluation in analogy and relation classification tasks. Importantly, this also holds for relations that are of a different nature than those in the SemEval dataset, showing that this process allows us to distill relational knowledge that is encoded in the pre-trained LM, rather than merely generalising from the examples that were used for fine-tuning.

Related Work
Probing LMs for Relational Knowledge Since the introduction of transformer-based LMs, a large number of works have focused on analysing the capabilities of such models, covering the extent to which they capture syntax (Goldberg, 2019;Saphra and Lopez, 2019;Hewitt and Manning, 2019;van Schijndel et al., 2019;Jawahar et al., 2019;Tenney et al., 2019), lexical semantics (Ethayarajh, 2019;Bommasani et al., 2020;Vulic et al., 2020), and various forms of factual and commonsense knowledge (Petroni et al., 2019;Forbes et al., 2019;Davison et al., 2019;Zhou et al., 2020;Talmor et al., 2020;Roberts et al., 2020), among others. The idea of extracting relational knowledge from LMs, in particular, has also been studied. For instance, Petroni et al. (2019) use BERT for link prediction. To this end, they use a manually defined prompt for each relation type, in which the tail entity is replaced by a <mask> token. To complete a knowledge graph triple such as (Dante, born-in, ?) they create the input "Dante was born in <mask>" and then look at the predictions of BERT for the masked token to retrieve the correct answer. It is notable that BERT is thus used for extracting relational knowledge without any fine-tuning. This clearly shows that a substantial amount of factual knowledge is encoded in the parameters of pre-trained LMs. Some works have also looked at how such knowledge is stored. Geva et al. (2020) argue that the feed-forward layers of transformer-based LMs act as neural memories, which would suggest that e.g. "the place where Dante is born" is stored as a property of Florence. Dai et al. (2021) present further evidence of this view. What is less clear, then, is whether relations themselves have an explicit representation, or whether transformer models essentially store a propositionalised knowledge graph. The results we present in this paper suggest that common lexical relations (e.g. hypernymy, meronymy, has-attribute), at least, must have some kind of explicit representation, although it remains unclear how they are encoded.
Another notable work focusing on link prediction is (Bosselut et al., 2019), where GPT is fine-tuned to complete triples from commonsense knowledge graphs, in particular ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019). While their model was able to generate new knowledge graph triples, it is unclear to what extent this is achieved by extracting commonsense knowledge that was already captured by the pre-trained GPT model, or whether this rather comes from the ability to generalise from the training triples. For the ConceptNet dataset, for instance, Jastrzębski et al. (2018) found that most test triples are in fact minor variations of training triples. In this paper, we also rely on fine-tuning, which makes it harder to determine to what extent the pre-trained LM already captures relational knowledge. We address this concern by including relation types in our evaluation which are different from the ones that have been used for fine-tuning.

Unsupervised Relation Discovery
Modelling how different words are related is a long-standing challenge in NLP. An early approach is DIRT (Lin and Pantel, 2001), which encodes the relation between two nouns as the dependency path connecting them. Their view is that two such dependency paths are similar if the sets of word pairs with which they co-occur are similar. Hasegawa et al. (2004) cluster named entity pairs based on the bag-of-words representations of the contexts in which they appear. Along the same lines, Yao et al. (2011) proposed a generative probabilistic model, inspired by LDA (Blei et al., 2003), in which relations are viewed as latent variables (similar to topics in LDA). Turney (2005) proposed a method called Latent Relational Analysis (LRA), which uses matrix factorization to learn relation embeddings based on co-occurrences of word pairs and dependency paths. Matrix factorization is also used in the Universal Schema approach from Riedel et al. (Riedel et al., 2013), which jointly models the contexts in which words appear in a corpus with a given set of relational facts.
The aforementioned works essentially represent the relation between two words by summarising the contexts in which these words co-occur. In recent years, a number of strategies based on distributional models have been explored that rely on similar intuitions but go beyond simple vector operations of word embeddings. 2 For instance, Jameel et al. (2018) introduced a variant of the GloVe word embedding model, in which relation vectors are jointly learned with word vectors. In SeVeN (Espinosa-Anke and Schockaert, 2018) and RELATIVE (Camacho-Collados et al., 2019), relation vectors are computed by averaging the embeddings of context words, while pair2vec (Joshi et al., 2019) uses an LSTM to summarise the contexts in which two given words occur, and Washio and Kato (2018) learn embeddings of dependency paths to encode word pairs. Another line of work is based on the idea that relation embeddings should facilitate link prediction, i.e. given the first word and a relation vector, we should be able to predict the second word (Marcheggiani and Titov, 2016;Simon et al., 2019). This idea also lies at the basis of the approach from Soares et al. (2019), who train a relation encoder by fine-tuning BERT (Devlin et al., 2019) with a link prediction loss. However, it should be noted that they focus on learning relation vectors from individual sentences, as a pre-training task for applications such as few-shot relation extraction. In contrast, our focus in this paper is on characterising the overall relationship between two words.

RelBERT
In this section, we describe our proposed relation embedding model (RelBERT henceforth). To obtain a relation embedding for given a word pair (h, t), we first convert it into a sentence s, called the prompt. We then feed the prompt through the LM and average the contextualized embeddings (i.e. the output vectors) to get the relation embedding of (h, t). These steps are illustrated in Figure 1 and explained in more detail in the following.

Prompt Generation
Manual Prompts A basic prompt generation strategy is to rely on manually created templates, which has proven effective in factual knowledge probing (Petroni et al., 2019) and text classification (Schick and Schütze, 2021;Tam et al., 2021;Le Scao and Rush, 2021), among many others. To test whether manually generated templates can be effective for learning relation embeddings, we will consider the following five templates: Learned Prompts The choice of prompt can have a significant impact on an LM's performance.
Since it is difficult to generate manual prompts in a systematic way, several strategies for automated generation of task-specific prompts have been proposed, e.g. based on mining patterns from a corpus (Bouraoui et al., 2020), paraphrasing (Jiang et al., 2020), training an additional LM for template generation (Haviv et al., 2021;Gao et al., 2020), and prompt optimization (Shin et al., 2020;Liu et al., 2021). In our work, we focus on the latter strategy, given its conceptual simplicity and its strong reported performance on various benchmarks. Specifically, we consider AutoPrompt (Shin et al., 2020) and P-tuning (Liu et al., 2021). Note that both methods rely on training data. We will use the same training data and loss function that we use for fine-tuning the LM; see Section 3.2.
AutoPrompt initializes the prompt as a fixedlength template: where π, τ , γ are hyper-parameters which determine the length of the template. The tokens of the form z i are called trigger tokens. These tokens are initialized as <mask>. The method then iteratively finds the best token to replace each mask, based on the gradient of the task-specific loss function. 3 P-tuning employs the same template initialization as AutoPrompt but its trigger tokens are newly introduced special tokens with trainable embeddingsê 1:π+τ +γ , which are learned using a taskspecific loss function while the LM's weights are frozen.

Fine-tuning the LM
To fine-tune the LM, we need training data and a loss function. As training data, we assume that, for a number of different relation types r, we have access to examples of word pairs (h, t) that are instances of that relation type. The loss function is based on the following intuition: the embeddings of word pairs that belong to the same relation type should be closer together than the embeddings of pairs that belong to different relations. In particular, we use the triplet loss from Schroff et al. (2015) and the classification loss from Reimers and Gurevych (2019), both of which are based on this intuition.

Triplet Loss
We draw a triplet from the relation dataset by selecting an anchor pair a = (h a , t a ), a positive example p = (h p , t p ) and a negative example n = (h n , t n ), i.e. we select word pairs a, p, n such that a and p belong to the same relation type while n belongs to a different relation type. Let us write x a , x p , x n for the corresponding relation embeddings. Each relation embedding is produced by the same LM, which is trained to make the distance between x a and x p smaller than the distance between x a and x n . Formally, this is accomplished using the following triplet loss function: We note that in most implementations of AutoPrompt the vocabulary to sample trigger tokens is restricted to that of the training data. However, given the nature of our training data (i.e., pairs of words and not sentences), we consider the full pre-trained LM's vocabulary.
where ε > 0 is the margin and · is the l 2 norm.
Classification Loss Following SBERT (Reimers and Gurevych, 2019), we use a classifier to predict whether two word pairs belong to the same relation. The classifier is jointly trained with the LM using the negative log likelihood loss function: with W ∈ R 3×d , u, v ∈ R d , | · | the element-wise absolute difference, and ⊕ concatenation.

Experimental Setting
In this section, we explain our experimental setting to train and evaluate RelBERT.

RelBERT Training
Dataset We use the platinum ratings from Se-mEval 2012 Task 2 (Jurgens et al., 2012) as our training dataset for RelBERT. This dataset covers 79 fine-grained semantic relations, which are grouped in 10 categories. For each of the 79 relations, the dataset contains a typicality score for a number of word pairs (around 40 on average), indicating to what extent the word pair is a prototypical instance of the relation. We treat the top 10 pairs (i.e. those with the highest typicality score) as positive examples of the relation, and the bottom 10 pairs as negative examples. We use 80% of these positive and negative examples for training RelBERT (i.e. learning the prompt and fine-tuning the LM) and 20% for validation.

Constructing Training Triples
We rely on three different strategies for constructing training triples. First, we obtain triples by selecting two positive examples of a given relation type (i.e. from the top-10 pairs) and one negative example (i.e. from the bottom-10 pairs). We construct 450 such triples per relation. Second, we construct triples by using two positive examples of the relation and one positive example from another relation (which is assumed to correspond to a negative example). In particular, for efficiency, we use the anchors and positive examples of the other triples from the same batch as negative examples (while ensuring that these triples are from different relations). Figure 2 illustrates this idea. Note how the effective batch size thus increases quadratically, while the number of vectors that needs to be encoded by the LM remains unchanged. In our setting, this leads to an additional 13500 triples per relation. Similar in-batch negative sampling has been shown to be effective in information retrieval (Karpukhin et al., 2020;Gillick et al., 2019). Third, we also construct training triples by considering the 10 high-level categories as relation types. In this case, we choose two positive examples from different relations that belong to the same category, along with a positive example from a relation from a different category. We add 5040 triples of this kind for each of the 10 categories.
Training RelBERT training consists of two phases: prompt optimization (unless a manually defined prompt is used) and language model finetuning. First we optimize the prompt over the training set with the triplet loss L t while the parameters of the LM are frozen. Subsequently, we fine-tune the LM with the resulting prompt, using the sum of the triplet loss L t and the classification loss L c over the same training set. We do not use the classification loss during the prompt optimisation, as that would involve training the classifier while optimizing the prompt. We select the best hyper-parameters of the prompting methods based on the final loss over the validation set. In particular, when manual prompts are used, we choose the best template among the five candidates described in Section 3.1. For AutoPrompt and Ptuning, we consider all combinations of π ∈ {8, 9}, τ ∈ {1, 2}, γ ∈ {1, 2}. We use RoBERTa (Liu et al., 2019) as our main LM, where the initial weights were taken from the roberta-large model checkpoint shared by the Huggingface transformers model hub (Wolf et al., 2020). We use the Adam optimizer (Kingma and Ba, 2014) with learn-ing rate 0.00002, batch size 64 and we fine-tune the model for 1 epoch. For AutoPrompt, the top-50 tokens are considered and the number of iterations is set to 50. In each iteration, one of the input tokens is re-sampled and the loss is re-computed across the entire training set. 4 For P-tuning, we train the weights that define the trigger embeddings (i.e. the weights of the input vectors and the parameters of the LSTM) for 2 epochs. Note that we do not tune RelBERT on any task-specific training or validation set. We thus use the same relation embeddings across all the considered evaluation tasks.

Evaluation Tasks
We evaluate RelBERT on two relation-centric tasks: solving analogy questions (unsupervised) and lexical relation classification (supervised).

Analogy Questions
We consider the task of solving word analogy questions. Given a query word pair, the model is required to select the relationally most similar word pair from a list of candidates. To solve this task, we simply choose the candidate whose RelBERT embedding has the highest cosine similarity with the RelBERT embedding of the query pair. Note that this task is completely unsupervised, without the need for any training or tuning. We use the five analogy datasets that were considered by Ushio et al. (2021): the SAT analogies dataset (Turney et al., 2003), the U2 and U4 analogy datasets, which were collected from an educational website 5 , and datasets that were derived 6 from BATS (Gladkova et al., 2016) and the Google analogy dataset (Mikolov et al., 2013b). These five datasets consist of tuning and testing fragments. In particular, they contain 37/337 (SAT), 24/228 (U2), 48/432 (U4), 50/500 (Google), and 199/1799 (BATS) questions for validation/testing. As there is no need to tune RelBERT on task-specific data, we only use the test fragments. For SAT, we will also report results on the full dataset (i.e. the testing fragment and tuning fragment combined), as this allows us to compare the performance with published results. We will refer to this full version of the SAT dataset as SAT †. Lexical Relation Classification We consider the task of predicting which relation a given word pair belongs to. To solve this task, we train a multi-layer perceptron (MLP) which takes the (frozen) RelBERT embedding of the word pair as input. We consider the following widely-  , which uses a 100-dimensional hidden layer and is optimized using Adam with a learning rate of 0.001. These datasets focus on the following lexical relations: co-hyponymy (cohyp), hypernymy (hyp), meronymy (mero), possession (poss), synonymy (syn), antonymy (ant), attribute (attr), event, and random (rand).

Baselines
As baselines, we consider two standard word embedding models: GloVe (Pennington et al., 2014) and FastText (Bojanowski et al., 2017), where word pairs are represented by the vector difference of their word embeddings (diff ). 7 For the classification experiments, we also consider the concatena-7 Vector difference is the most common method for encoding relations, and has been shown to be the most reliable in the context of word analogies (Hakami and Bollegala, 2017). tion of the two word embeddings (cat) and their element-wise multiplication 8 (dot). We furthermore experiment with two pre-trained word pair embedding models: pair2vec (Joshi et al., 2019) (pair) and RELATIVE (Camacho-Collados et al., 2019) (rel). For these word pair embeddings, as well as for RelBERT, we concatenate the embeddings from both directions, i.e. (h, t) and (t, h). For the analogy questions, two simple statistical baselines are included: the expected random performance and a strategy based on point-wise mutual information (PMI) Church and Hanks (1990). In particular, the PMI score of a word pair is computed using the English Wikipedia, with a fixed window size of 10. We then choose the candidate pair with the highest PMI as the prediction. Note that this PMI-based method completely ignores the query pair. We also compare with the published results from Ushio et al. (2021), where a strategy is proposed to solve analogy questions by using LMs to compute an analogical proportion score. In particular, a four-word tuple (a, b, c, d) is encoded using a custom prompt and perplexity based scoring strategies are used to determine whether the word pair (a, b) has the same relation as the word pair (c, d). Finally, for the SAT † dataset, we compare with the published results from GPT-3 (Brown et al., 2020), LRA (Turney, 2005) and SuperSim (Turney, 2013); for relation classification we report the published results of the LexNet (Shwartz et al., 2016) and SphereRE (Wang et al., 2019) relation classification models, taking the results from the latter publication. We did not reproduce these latter methods in similar conditions as our work, and hence they are not fully comparable. More-over, these approaches are a different nature, as the aim of our work is to provide universal relation embeddings instead of task-specific models.

Results
In this section, we present our main experimental results, testing the relation embeddings learned by RelBERT on analogy questions (Section 5.1) and relation classification (Section 5.2). Table 2 shows the accuracy on the analogy benchmarks. The RelBERT models substantially outperform the baselines on all datasets, except for the Google analogy dataset. 9 Comparing the different prompt generation approaches, we can see that, surprisingly, the manual prompt consistently outperforms the automatically-learned prompt strategies.

Analogy Questions
On SAT †, RelBERT outperforms LRA, which represents the state-of-the-art in the zero-shot setting, i.e. in the setting where no training data from the SAT dataset is used. RelBERT moreover outperforms GPT-3 in the few-shot setting, despite not using any training examples. In contrast, GPT-3 encodes a number of training examples as part of the prompt.
It can furthermore be noted that the other two relation embedding methods (i.e. pair2vec and REL-ATIVE) perform poorly in this unsupervised task. The analogical proportion score from Ushio et al.
(2021) also underperforms RelBERT, even when tuned on dataset-specific tuning data. Table 3 summarizes the results of the lexical relation classification experiments, in terms of macro and micro averaged F1 score. The RelBERT models achieve the best results on all datasets except for BLESS and K&H+N, where the performance of all models is rather close. We observe a particularly large improvement over the word embedding and SotA models on the EVALution dataset. When comparing the different prompting strategies, we again find that the manual prompts perform surprisingly well, although the best results are now obtained with learned prompts in a few cases. 9 The Google analogy dataset has been shown to be biased toward word similarity and therefore to be well suited to word embeddings (Linzen, 2016;Rogers et al., 2017).

Analysis
To better understand how relation embeddings are learned, in this section we analyze the model's performance in more detail.

Training Data Overlap
In our main experiments, RelBERT is trained using the SemEval 2012 Task 2 dataset. This dataset contains a broad range of semantic relations, including hypernymy and meronymy relations. This raises an important question: Does RelBERT provide us with a way to extract relational knowledge from the parameters of the pre-trained LM, or is it learning to construct relation embeddings from the triples in the training set? What is of particular interest is whether RelBERT is able to model types of relations that it has not seen during training. To answer this question, we conduct an additional experiment to evaluate RelBERT on lexical relation classification, using a version that was trained without the relations from the Class Inclusion category, which is the high-level category in the SemEval dataset that    includes the hypernymy relation. Hypernymy is of particular interest, as it can be found across all the considered lexical relation classification datasets, which is itself a reflection of its central importance in lexical semantics. In Table 4, we report the difference in performance compared to the original RelBERT model (i.e. the model that was fine-tuned on the full SemEval training set). As can be seen, the overall changes in performance are small, and the new version actually outperforms the original RelBERT model on a few datasets. In particular, hypernymy is still modelled well, confirming that RelBERT is able to generalize to unseen relations.  As a further analysis, Table 5 shows a breakdown of the Google and BATS analogy results, showing the average performance on each of the top-level categories from these datasets. 10 While RelBERT is outperformed by FastText on the morphological relations, it should be noted that the differences are small, while such relations are of a very different nature than those from the SemEval dataset. This confirms that RelBERT is able to model a broad range of relations, although it can be expected that better results would be possible by including task-specific training data into the fine-tuning process (e.g. including morphological relations for tasks where such relations matter).   Figure 3 compares the performance of RelBERT with that of the vanilla pre-trained RoBERTa model (i.e. when only the prompt is optimized). As can be seen, the fine-tuning process is critical for achieving good results. In Figure 3, we also compare the performance of our

Qualitative Analysis
To give further insight into the nature of RelBERT embeddings, Table 6 shows the nearest neighbors of some selected word pairs from the evaluation datasets. To this end, we computed RelBERT relation vectors for all pairs in the Wikipedia pretrained RELATIVE vocabulary (over 1M pairs). 12 The neighbors are those word pairs whose Rel-BERT embedding has the highest cosine similarity within the full pair vocabulary. As can be seen, the neighbors mostly represent word pairs that are relationally similar, even for morphological relations (e.g. dog:dogs), which are not present in the SemEval dataset. A more extensive qualitative analysis, including a comparison with RELATIVE, is provided in the appendix.

Conclusion
We have proposed a strategy for learning relation embeddings, i.e. vector representations of pairs of words which capture their relationship. The main idea is to fine-tune a pre-trained language model using the relational similarity dataset from SemEval 2012 Task 2, which covers a broad range of semantic relations. In our experimental results, we found the resulting relation embeddings to be of high quality, outperforming state-of-the-art methods on several analogy and relation classification benchmarks. Among the models tested, we obtained the best results with RoBERTa, when using manually defined templates for encoding word pairs. Importantly, we found that high-quality relation embeddings can be obtained even for relations that are unlike those from the SemEval dataset, such as morphological and encyclopedic relations. This suggests that the knowledge captured by our relation embeddings is largely distilled from the pre-trained language model, rather than being acquired during training.

A Additional Experimental Results
In this section, we show additional experimental results that complement the main results of the paper.

A.1 Vanilla LM Comparison
We show comparisons of versions of RelBERT with optimized prompt with/without finetuning. Figure 4 shows the absolute accuracy drop from RelBERT (i.e. the model with fine-tuning) to the vanilla RoBERTa model (i.e. without fine-tuning) with the same prompt. In all cases, the accuracy drop for the models without fine-tuning is substantial.

A.2 Comparison with ALBERT & BERT
We use RoBERTa in our main experiments and here we train RelBERT with ALBERT and BERT instead, and evaluate them on both of the analogy and relation classification tasks. Table 7 shows the accuracy on the analogy questions, while Table 8 shows the accuracy on the relation classification task. In both tasks, we can confirm that RoBERTa Figure 4: Test accuracy drop of the vanilla models without fine-tuning (measured in terms of absolute percentage points in comparison with RelBERT) on analogy datasets.
achieves the best performance within the LMs, by a relatively large margin in most cases. Table 9 shows additional results of word embeddings on analogy test together with RelBERT results. We concatenate the RELATIVE and pair2vec vectors with the word vector difference. However, this does not lead to better results.

B Experimental Details and Model Configurations
In this section, we explain models' configuration in the experiments, and details on RelBERT's training time. Table 10 shows the best prompt configuration based on the validation loss for the SemEval 2012 Task 2 dataset in our main experiments using RoBERTa. Table 11 shows the best hyperparameters in the validation set of the MLPs for relation classification.

B.3 Training Time
Training a single RelBERT model with a custom prompt takes about half a day on two V100 GPUs. Additionally, to achieve prompt by AutoPrompt  technique takes about a week on a single V100, while P-tuning takes 3 to 4 hours, also on a single V100.

C Implementation Details of AutoPrompt
All the trigger tokens are initialized by mask tokens and updated based on the gradient of a loss function L t . Concretely, let us denote the loss value with template T as L t (T ). The candidate set for the j th trigger is derived bỹ where the gradient is taken with respect to j th trigger token and e w is the input embedding for the word w. Then we evaluate each token based on the loss function as where rep(T, j, w) replaces the j th token in T by w and j is randomly chosen. We ignore any candidates that do not improve current loss value to further enhance the prompt quality.

D Additional Analysis
In this section, we analyze our experimental results based on prediction breakdown and provide an extended qualitative analysis.   Table 9: Test accuracy (%) on analogy datasets (SAT † refers to the full SAT dataset).

D.2 Qualitative Analysis
Tables 5 shows the nearest neighbors of a number of selected word pairs, in terms of their RelBERT and RELATIVE embeddings. In both cases cosine similarity is used to compare the embeddings and the pair vocabulary of the RELATIVE model is used to determine the universe of candidate neighbors.
The results for the RelBERT embeddings show their ability to capture a wide range of relations. In most cases the neighbors make sense, despite the fact that many of these relations are quite different from those in the SemEval dataset that was used for training RelBERT. The results for RELATIVE are in general much noisier, suggesting that REL-ATIVE embeddings fail to capture many types of relations. This is in particular the case for the morphological examples, although various issues can be observed for the other relations as well.