Probing Relational Knowledge in Language Models via Word Analogies

Understanding relational knowledge plays an integral part in natural language understanding. When it comes to pre-trained language models (PLMs), prior work has been focusing on probing relational knowledge by filling the blanks in pre-defined prompts such as "The capital of France is —". However, these probes may be affected by the co-occurrence of target relation words and entities (e.g. "capital", "France" and "Paris") in the pre-training corpus. In this work, we extend these probing methodologies leveraging analogical proportions as a proxy to probe relational knowledge in transformer-based PLMs without directly presenting the desired relation. In particular, we analysed the ability of PLMs to understand (1) the directionality of a given relation (e.g. Paris-France is not the same as France-Paris ); (2) the ability to distinguish types on a given relation (both France and Japan are countries); and (3) the relation itself (Paris is the capital of France, but not Rome). Our results show how PLMs are extremely accurate at (1) and (2), but have room for improvement for (3). To better understand the reasons behind this behaviour and the types of mistake made by PLMs, we provide an extended quantitative analysis.


Introduction
A major area of research in NLP in the past years has been devoted to probing pre-trained language models (PLMs) to measure the extent of which the relational/factual knowledge is captured by their representations (Bouraoui et al., 2020;Jiang et al., 2020;Wallat et al., 2020).Seeking insight into the hidden representational space of PLMs, recent studies have been exploiting the word prediction capabilities of language models in a cloze-style fillthe-blank configuration as a more direct method to probe factual knowledge in PLMs (Petroni et al., 2019;Wallat et al., 2020).For example, in order to query for the capital of Paris, one can use the PLM to fill the blank in a prompt such as 'The capital of Paris is -' by predicting the most probable word as a response.
First, while there have been studies that attempted to find automatic templates that may overcome the reliance of specific prompts (Shin et al., 2020;Liu et al., 2021), it has also been shown that PLMs may fail to understand simple features such as negation (Kassner and Schütze, 2020).Indeed, PLMs may even be biased towards the words on the prompt, and base their answers on this or other confounds (e.g.co-occurrences) instead of understanding the relation itself.For instance, in the previously-mentioned prompt 'The capital of Paris is -', capital is already mentioned in the prompt.
To address this, one possible solution is to rely on word analogies.In order to study the existence of a target relation between a pair of words (e.g. the 'country:capital' relation between Paris and France), one could simply put them together with another pair holding the target relation (e.g.Rome and Italy) in an analogy sentence (e.g.'Paris to France is like Rome to Italy') and probe the model in a binary classification setting to check whether the analogy holds.1 .Taking word analogies as the main reference, our aim is therefore to understand whether the transformer-based language models hold sufficient relational information to classify an analogy as being true or false.While from previous work we know that PLMs are indeed able to solve various types of analogies (Ushio et al., 2021b), we are interested in analyzing three different aspects for which we propose three probe tasks.In short, these probes are aimed at understanding the PLMs capability for (1) making fine-grained distinctions between concepts within the same types; (2) capturing the directionality of unidirectional relations; and (3) distinguishing types such as the difference In this section, we describe our methodology to probe relational knowledge from language models through word analogies.Given two different tuples (h 1 , t 1 ) and (h 2 , t 2 ), and an analogy template T , we can generate an analogy sentence by inserting the head and tail words of tuples into their respective positions in T .An analogy sentence holds with respect to relation R if both tuples used in the generation process are members of the relation set R. For instance, the tuples (P aris, F rance) and (Rome, Italy) form a correct analogy for the 'country:capital' relation.

Probe Evaluation Setting
For our probes, we rely on two distinct settings which are usually linked to different usages of the language models, namely supervised and unsupervised (or zero-shot).The starting point for these two probe settings is a relation R = {(h 1 , t 1 ), (h 2 , t 2 ), (h 3 , t 3 ), ...} where (h i , t i ) represents a tuple belonging to the given relation.

Supervised setting.
Having the relation set R as input, we can frame the task as a binary classification where the input is a tuple and the output is True or False depending on whether the tuple belongs to the relation or not.For example, Paris-France would be a positive example for the capital-of relation, while Paris-Rome would be a negative example.In turn, R can be easily split between training and test sets, and negative samples can be obtained in different ways depending on the actual probe.
Unsupervised setting.In this setting, similarly obtained negative tuples can be paired with positive tuples.In this case, given an input pair (h i , t i ) from R, and a pair of two additional tuples (a positive and a negative example), the task would consist of identifying the tuple better representing the relation R. For instance, given Paris-France and the tuples Rome-Italy and Italy-France as possible options, the correct answer would be Rome-Italy.
These two settings (i.e., supervised and unsupervised) provide additional insights in relation to two distinct theories with respect to relational knowledge.The supervised binary classification setting corresponds to the rigid theory in which relations either hold or not (Beckwith et al., 1991), present in resources such as WordNet (Miller, 1995).Stemming from cognitive psychology, the unsupervised comparative evaluation setting reflects on the graded assumption in which relations are a more fluid and it is not always possible to provide a clear binary distinction (Rosch, 1973).This comparative setting has been the basis to construct graded relational datasets (Vulić et al., 2017).

Probes
We propose three probes to understand how language models capture three different aspects within relational knowledge.The probes mainly construct negative samples in a different way in order to test various features.The input in all cases is any given (h i , t i ) in R. Figure 1 lists some sample positive and negative pairs for the three probes.
Random replacement.There are two ways to construct negative samples for this probe.A negative sample in this probe consists of (1) a tuple in R and an auxiliary tuple (h j , t i ) in which h j ̸ = h i (we refer to this probe as random-head); or (h i , t j ) in which t i ̸ = t j (random-tail).This probe aims at understanding how hard it is for the language model to identify a relation when the types of the head and tail are maintained.
Reverse direction.For the negative samples, the auxiliary tuple is simply a random tuple from R in which the positions of head and tail are reversed and follows the form (t i , h i ).This probe aims at understanding to what extent PLMs understand the directionality of a given relation.
Type.For each input (h i , t i ), as negative examples we take any two tuples The goal of this probe is to test the capability of PLMs to understand the types of a given relation and, specifically, that different types are required for the relation to hold.

Datasets
We opted for the relation sets in Bigger Analogy Test Set (BATS) dataset (Gladkova et al., 2016).BATS has been shown to be more robust and complete to other analogy datasets such as Googleanalogies (Mikolov et al., 2013).BATS covers a collection of 40 different inflectional and derivational morphology, lexicographic and encyclopedic relation sets.For each of these sets, we create six distinct datasets for each setting and probe. 2  In the supervised setting, all datasets are initially split into training and test, with half of the tuples in each partition.Then, having an initial train/test partition R of n tuples, we can generate n × (n − 1) positive by pairing each tuple with each other.In order to keep the dataset balanced, we also generate the same number of negative samples by following the probe methodologies described in the previous subsection.To form instances in the unsupervised dataset, we simply add a negative example to each positive instance from the supervised datasets.

Probe Architecture
For our supervised probe, we opted for RoBERTalarge and RoBERTa-base (Liu et al., 2019) as the PLMs in our experiments.In order to feed our dataset samples to these models, we make use of the analogy template "What -is to -, -is to -.", which was shown to be the most reliable generalpurpose prompt for modelling analogies in Ushio et al. (2021b).We then pull the embedding of the target words and concatenate them together to form a larger feature vector which is ultimately fed into a simple multi-layer perceptron binary classifier.
For the unsupervised setting, we pick the option tuple that results in an analogy sentence with lowest pseudo-perplexity when inserted into our analogy template together with the query tuple.Given the tokenized form [w 1 , w 2 , ..., w |S| ] of a sentence S, pseudo-perplexity is defined as: 2 Datasets are available in the supplementary material.

PPPL (S) = exp
in which P w i |S \i is the pseudo-likelihood (Wang and Cho, 2019) and S \i is the tokenized form of S where the i-th token is replaced with a <mask> token.

Probe Evaluation
In this section, we present the results of our probe evaluation based on the methodology described in the previous section.First, we describe the embedding-based baselines in Section 3.1, and then we present the experimental results in Section 3.2.

Baselines
In order to put our results into perspective, we perform experiments using two embedding-based baselines using both relation and word embeddings.As relation embedding model we compare with RelBERT (Ushio et al., 2021a), a model specifically trained to extract relation embeddings from language models.Since RelBERT does not require the input tuples to be in a context, we can use the tuples without any analogy template.In the case of the unsupervised experiments, we extract the RelBERT relation embeddings of the input and candidate tuples, and choose the candidate tuple that has the embedding with highest cosine similarity to that of the input tuple.For the supervised setting, we simply feed the concatenation of the embedding vectors to a multi-layer perceptron binary classifier.
Similarly, we also report the results of a simple FastText-based (Bojanowski et al., 2016) static word embedding baseline.For this baseline, the relation embedding is obtained by simply computing the difference of individual word embeddings in a tuple, which is the standard pair encoding method used in the literature (Weeds et al., 2014;Vylomova et al., 2016;Camacho-Collados et al., 2019).Once this pair embedding is obtained, the rest of the methodology is the same as the one described for RelBERT.

Results
Table 1 shows our main experimental results.At first glance, semantic relations appear to be harder than morphological ones.When analysing model size, the larger RoBERTa model consistently outperforms its smaller counterpart, which goes in line with general language modelling results and in particular for modelling relations (Petroni et al., 2019).Regarding the supervised experiments, RoBERTa performs better in the reverse and type probes compared to random.This indicates the ability of PLMs (and in general distributional models given the strong fastText-based results) to capture word categories and their direction, while having room for improvement when it comes to capture more fine-grained distinctions proposed in the random probe.
In the unsupervised experiments the difference between relations is less marked in the case of PLMs.In this setting, except for the random probe, a simple word embedding baseline such as fastText prove more reliable.The superior performance on the reverse and type probes compared to random is more pronounced in the case of fastText baseline.This suggests that PLMs can capture more finegrained meaning variances compared to static embeddings.Moreover, the comparable performance of fastText to the best performing models in unsupervised reverse and random probes indicates that contextual information encoded in contextualized embeddings, as opposed to the type/category information, play a less important role in these probing configurations.

Analysis
Since the goal of this paper is to probe PLMs for relational knowledge, for this extended analysis we focus on the supervised setting of the plain RoBERTa-large, which is more in line with most downstream applications.
Word Frequency First, we estimated the number of word occurrences in the underlying pre-training corpora3 , and following Chiang et al. (2020) we took the harmonic mean of occurrences of words in an output tuple as an estimate of tuple occurrence frequency.We hypothesized that most of the errors may be produced when the frequency of the output pair is low, as the RoBERTa may be less familiar with the words themselves.To this end, we computed a Kolmogorov-Smirnov for each relation type in which we separated the instances by correct and wrong decisions made by the model in the 'random' probe.For most relation types (over two thirds), p-values are higher than 0.05 for which we can conclude that frequency does not play a significant role in the performance.Those relation types where the effect seems more significant are male-female and adj:comparative. 2 shows a breakdown of the results by relation type.In general, we can observe poor performance on one/many-tomany type relations in 'random' and bidirectional relations in 'reverse'.This is an interesting sanity check which we would expect given the nature of the probes.In order to disentangle the effect that these relations may have in the final performance, we also computed the average accuracy excluding all one/many-to-many and bidirectional relation types.The main conclusions from Section 3.2 hold in which the 'random' probe appears to be harder than the others, with the average overall performance being 68.5 ('random'), 98.2 ('reverse') and 95.6 ('type').Another interesting finding is the consistent superior performance on 'reverse' compared to 'type'.This is true in particular for the relations where head and tail words are coming from closer general categories (e.g.meronyms:part), which indicates that merely relying on word types may not be enough to capture directionality.

Conclusion
In this paper, we have presented three probes to understand to what extent PLMs (or any model in general) understand different aspects of the relations.In general, the 'random' probe proves the most challenging for PLMs, which is aimed at capturing some fine-grained information between the different types in a relation.In contrast, these models can accurately capture the aspects related to directionality and the word categories (or types) involved in a relation.When investigating the reasons of this discrepancy, we did not find a clear correlation between word frequency and performance for the 'random' probe, except for specific relation types.In general, however, based on our unsupervised experiments, PLMs seem to be better equipped to solve this probe when comparing between different pairs, even without task-specific training data.

Limitations
Our experiments are limited in various respects.First, the only language analysed is English, which limits the conclusions that can be taken with respect to other languages, especially those structurally different and from different families.Second, our experiments are based on a limited number of both models (which can additionally vary in size with potentially different conclusions) and configurations/prompts.While we follow standard practice, there are potentially configurations that have not been explored and could alter the significant of the results.Third, word analogies have been shown by previous research to be prone to external biases or confounding factors that can alter the results (Linzen, 2016;Gladkova et al., 2016;Nissim et al., 2020).We minimized this impact by proposing clear binary classification and comparative tasks, instead of the usual predictive framing in word analogies.Fourth, the data utilised corresponds to a single dataset, i.e.BATS.While this dataset was constructed so a wide variety of relations are covered, these are still limited in number (40) and biased towards certain categories.All in all, our study can be considered to be a first attempt to probe relational knowledge through word analogies, which appears to be a promising area for future work.

Figure 1 :
Figure 1: Sample positive and negative pairs for the three proposed probes given Paris-France as input.

Table 1 :
Average accuracy results of comparison models on our probe datasets grouped by general relation types (ES: Encyclopedic Semantics, DM: Derivational Morphology, LS: Lexicographic Semantics, IM: Inflectional Morphology).The last row includes the overall averaged results for all relations types.

Table 2 :
RoBERTa-large results grouped by relation type (supervised setting).The results of one/many-to-many relations on the random probe and bidirectional relations on the reverse probe are marked by * and †, respectively.