Bridging Perception, Memory, and Inference through Semantic Relations

There is a growing consensus that surface form alone does not enable models to learn meaning and gain language understanding. This warrants an interest in hybrid systems that combine the strengths of neural and symbolic methods. We favour triadic systems consisting of neural networks, knowledge bases, and inference engines. The network provides perception, that is, the interface between the system and its environment. The knowledge base provides explicit memory and thus immediate access to established facts. Finally, inference capabilities are provided by the inference engine which reflects on the perception, supported by memory, to reason and discover new facts. In this work, we probe six popular language models for semantic relations and outline a future line of research to study how the constituent subsystems can be jointly realised and integrated.


Introduction
Recent works (Bender and Koller, 2020;Bender et al., 2021) postulate that it is impossible to learn meaning from surface form alone, and express concerns about what is perceived as an over-reliance on large-scale pretrained neural networks. This line of thought supports the interest in hybrid systems that amalgamate elements from complementary learning paradigms (see, e.g., (Pearl, 2019;Wang et al., 2019;Hohenecker and Lukasiewicz, 2020;van Bekkum et al., 2021)). In (Dahlgren et al., 2021), we argue that this calls for an explicit distinction to be made between the faculties of perception, memory, and inference. We therefore promote the development of systems that consist of subsystems with responsibilities corresponding to the three faculties. Such future systems would thus * The authors are given in alphabetical order.
† The work of this author is partially funded by the Swedish Research Council, Grant number 2020-03852, and the Wallenberg Autonomous Systems and AI program. ‡ Contact Author consist of a perception component realised by a neural network, a component that provides explicit memory in the form of a knowledge base, and a third one performing symbolic inference, that is, rule-based reasoning. We suggest to study how the subsystems can be aligned so for a seamless information flow between them. We view it as particularly important that (i) the network and the knowledge base together yield a consistent treatment of semantic relations and (ii) training takes the knowledge base into account, so that the resulting embeddings are consistent with established facts. Our conceptual discussion is complemented by a preliminary empirical evaluation of six popular English language models, which we subject to linear probes to test their abilities to capture central semantic relations.
After a brief discussion of related work in Section 2, Section 3 discusses the role of semantic relations in the context of our envisioned triad system while Section 4 and 5 of this paper complement our conceptual discussion with a preliminary empirical evaluation of the chances to achieve (i) by probing six popular language models with respect to a semantic relation learning task.

Related work
There is a rapidly growing literature on relation extraction and hybrid systems. Petroni et al. (2019) observe that language models such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) are imprinted with large amounts of common sense and factual knowledge during training. If this information can be reliably extracted then, they argue, word embeddings could find a new use as knowledge bases. To test the practicality of this approach, they consider a knowledge extraction task where a language model is given a sentence containing a subject word x and a relation R, but where the object word y has been removed, and the model should guess the missing y (i.e., rank the vocabu-lary words) based on the fact that x and y are in the relation R. The sentences are generated based on manually constructed templates, one per relation. For example, to the relation birth-place, they use the template " subject was born in blank " and instantiate it to "Dante was born in blank ". The most important baselines are two variations of the relation extraction model by Sorokin and Gurevych (2017). Key findings are that language models appear to be better at learning one-to-one relations, whereas the relation extraction models are better at picking out many-to-many relations. Petroni et al. also find that the choice of template has an impact on the performance of the language models, and point this out as an item for future work. Bouraoui et al. (2020) pick up this thread and propose a method for extracting good template sentences from BERT, and using these to fine-tune BERT so as to improve its performance on relation extraction. For a target binary relation R (represented as a set of ordered pairs) and a sample of pairs R ⊆ R, they filter the training data for sentences expressing that x and y, with (x, y) ∈ R, have the relation R, and which would still be natural if x and y where simultaneously replaced by some other (x , y ) ∈ R. Finally, they fine-tune a language model to predict, from an instantiation of one of the remaining sentences with a pair (x , y ), whether (x , y ) ∈ R. The most relevant aspect of this work for the present effort is the evaluation of the Bigger Analogy Test Set (also known as BATS) which contains 40 relations with 50 instances per relation (Gladkova et al., 2016). Bouraoui et al. (2020) report a mixed performance on the type of semantic relations considered here, namely hypernyms and hyponyms.
Additional methods for choosing template sentences are proposed by Jiang et al. (2020) who, similar to Bouraoui et al. (2020), mine the training data for suitable sentences. A dependency analysis on candidate sentences makes it possible to extract a larger variety of phrases that express the desired relationship than Bouraoui et al. (2020) can. The authors also generate candidate sentences by paraphrasing. In short, they find that both mined and paraphrasing have their usages, and that combinations of template types, e.g., manually constructed and mined, often perform well. Poerner et al. (2019) question the conclusion by Petroni et al. (2019) that BERT contains factual knowledge derived from the training data. The authors believe that in may cases, BERT simply exploits superficial similarities and general patterns to guess what is most likely. For example, from the fact that a person has a typically French surname, BERT could guess that that person is actually French without having learned the nationality of the particular person. To expose this weakness, (Poerner et al., 2019) remove what they believe are easily guessed pairs of subjects and objects from the data set of (Petroni et al., 2019). They also provide a modified version of BERT, E-BERT, in which the embeddings of entities mentioned in Wikipedia have been replaced by a symbolic entity embedding. They find that E-BERT outperforms both BERT and ERNIE on the trimmed data set, but also that a combination E-BERT and BERT (taking the average of or concatenating the embeddings) give higher accuracy than either on its own. Rosenbloom (2010) model different types of declarative and procedural memory with what is essentially weighted hypergraphs, in which nodes correspond to actions and conditions, and edges to activation functions. Procedural and declarative memory are distinguished based on the direction in which values are propagated through the hypergraph. The analogy to human cognition is that procedural memory contains information about how to do something, whereas declarative memory concerns facts and events.

The role of semantic relations
As the brief account given in the previous section shows, there is a solid body of work on the extraction of relations from language models (see Section 2), to derive facts such as that the birth place of Olga Tokarczuk is Sulechów, Poland, and that the capital of Bolivia is La Paz. Looking to knowledge bases, it is natural to view them as graphs, where nodes represent objects and properties, and edges represent semantic relations. Finally, for fish vertebrate craniate centralised brain starfish hypernym hyponym synonym meronym meronym / Figure 1: In this work we focus on recovering synonyms, hypernyms, hyponyms, and meronyms from natural language models via probing to understand the prerequisites of integration with knowledge bases.

Synonymy
Hyponymy Meronymy   band  set  assumption  theory  house  library  circle  miracle  attic  ring  audacity  porch   office  agency  copper  metal  road  bend  bureau  penny  crossing  authority policeman turnout logical inference, basic semantic relations such as synonymy, hyponymy, hypernymy, and meronymy play a central role. We recall that words are synonyms if they have (nearly) the same meaning; that a hypernym of a concept is a generalisation of that concept (e.g., 'bird' is a hypernym of 'sparrow'), while a hyponym is an instance of the concept (e.g., 'spider' is a hyponym of 'arachnid'), and that a meronym of a concept is a part of the whole (e.g., 'branch' is a meronym of 'tree'); see Table 1 for examples found in WordNet (Fellbaum, 1998).
For logical inference, we can infer that starfish are not fish from knowing that 'heart' is a meronym of 'craniate' but not of 'starfish' (all craniates have hearts whereas starfish do not), 'vertebrate' is a hypernym of 'fish' (fish are vertebrates), and 'craniate' is a synonym of 'vertebrate'. See Figure 1 and Table 1 for further examples.
To achieve a seamless integration of a neural network with a knowledge base of relations and an inference engine, we propose to devise methods for (i) enabling the network to utilise the knowledge base, but fall back on the less certain information in the embedding when necessary and (ii) taking the relations in the knowledge base into account during network training, so that the trained network reflects the contents of the knowledge base. In this endeavour, we believe that particular emphasis should be placed on the treatment of lexicosemantic relations such as meronymy, hyponymy, and synonymy because of their central role in logical deduction and lexical semantics.

Empirical study: method
To gain some initial insight into how well stateof-the-art pretrained contextual embeddings handle lexico-semantic relations, we conducted experiments on word embeddings generated by AL-BERT (Lan et al., 2020), ROBERTa (Liu et al., 2019), BERT (Wolf et al., 2019), and GPT-2 (Radford et al., 2019). We also included Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) models in our experiments, for comparison. These are all self-supervised learning algorithms, based on neural networks and built to translate words into vector representations. BERT and GPT-2 are transformer models, each having 12 encoder layers. ROBERTa is a retraining of BERT on a larger data set, while ALBERT is an extension of BERT that has a higher data throughput with 10x fewer parameters, and thus scales better.
In contrast to the works discussed in Section 2, we do not extract relations from the embeddings by means of linguistic templates. Rather, we view sentence extraction as an instance of probing (Rogers et al., 2018;Conneau et al., 2018;Yaghoobzadeh et al., 2019;Hupkes et al., 2020), a diagnostic method to reveal what aspects of the input the embedding actually encodes. Probing tasks should ideally be agnostic as to the underlying encoder architecture, so that results are transferable between embeddings (Hewitt and Liang, 2019;Dahlgren et al., 2021). Random control tasks (Hewitt and Liang, 2019) are implemented, see discussion in Section 5. In our experiments, we considered the following probing task: Given a pair of word vectors, we ask whether the encoded words are in relation R. This avoids the optimisation problem linked to the choice of template seen in (Petroni et al., 2019).
All experiments are on the English language, and the data set used in our experiments was obtained from WordNet as follows. We first built a vocabulary V by taking the 5 000 most common nouns in the Brown corpus (Kucera and Francis, 1967) and removing those not found in WordNet (Fellbaum, 1998). This resulted in a vocabulary of 3497 words. For each word w in the vocabulary V and target relation R ∈ {hypernym, meronym, synonym} we then picked words v and v in V such that  Table 2: The probing accuracy on the semantic relations, with variance given in parentheses. The accuracy of a "largest class" strategy is shown next to each relation. All transformers give embeddings of 768 dimensions, with word2vec and GloVe using 300 dimension. Each relation contain 1712, 306, 2740, and 1630 samples, respectively.
We formulate a classification task for each relation R, and probe each of the investigated models for their ability to capture each relation in their respective embeddings. Each classification task is based on 1 712, 306, 2 740, 1 630 samples for synonyms, meronyms, hypernyms, and hyponyms respectively. We use a linear classifier probe as these better reflect the availablity of the information probed for, as shown in (Hewitt and Liang, 2019;Dahlgren et al., 2021). From (w, v, v ), positive (w, v) and negative (w, v ) examples are drawn with equal probability, labeled either 0 or 1, to represent if the tuple represents a negative or a positive pair. The binary labels are given together with either (w, v) or (w, v ) as input to the probe by concatenating both word embeddings. We train the probe for 10 epochs using 5-fold cross validation, using softmax activation, dropout of 0.2 to prevent memorising samples, and cross-entropy loss with the Adam optimizer using a lr = 0.001. We average the results over 5 runs. The experiment is implemented with Pytorch for CPU and uses the Huggingface (Wolf et al., 2019) library for all pretrained transformers, and the Gensim (Rehurek and Sojka, 2011) library for word2vec and GloVe. The experiments completed within 1 hour on an Intel i7-based Linux laptop with 32GB RAM. The code is available on Github 1 . Table 2 displays the numerical results, with the header row showing, for each relation R, the size of the larger of the two classes. This number coincides with the control tasks implemented to measure se-1 https://github.com/dali-does/semprof lectivity, which are omitted to limit redundancy. The table shows linear probe classification accuracy for each language model, with the variance written out within parentheses. As can be expected, the variance is highest for meronyms where there is least data.

Results and discussion
Various observations can be made by comparing the results for the individual embeddings. Particularly worthwhile noting is the fact that GloVe and word2vec performs on par or better than the contextual embeddings, except for the case of hyponyms. This behaviour was seen with 5 and 20 training epochs as well.
The relatively strong performance of the pretransformer solutions may not be surprising as far as synonyms are concerned, since their construction builds around aligning words found in the same context. However, we would not have expected similar results for hypernyms and even lesser so for meronyms. We note that ALBERT does not accessibly encode any of the relations, resulting in random guesses. This could be because ALBERT is trained using tenfold fewer parameters to produce much smaller embeddings, and might have less room for this type of information. Since AL-BERT is comparable in performance to, e.g., BERT on many data sets and other metrics, this needs further investigation to see to what extent these relations are present in the data sets. The complexity of the probe could also be the culprit, as an embedding with lower dimensionality poses a more difficult task for a probe with limited capabilities of separating intertwined concepts. These results do not mirror those of Lan et al. (2020), which indicates that the relations studied here could receive more attention in future evaluations of language embeddings. ROBERTa seems to generally outper-form the other transformers, especially on the hyponyms, taking into account that not all results are statistically significant. Hypo-/hypernym relations usually follows a tree hierarchy, with hypernyms directed towards the root. This gives a decreasing number of hypernyms, for example, fish has six hypernyms but 39 hyponyms in WordNet, and it is likely that less common words will be chosen as a positive example for hyponyms. Weighting the words according to frequency could show different results, but filtering words based on the data the models are trained on is counterproductive to the purpose of these probes. ROBERTa is better able to capture synonyms, which could be an effect of the much larger dataset used in training compared to the other BERT-models leading to more of the less common examples of hyponyms being seen more. One hypothesis on why GPT-2 also shows poor performance is that Wikipedia is removed from the training data. The proposition is that many Wikipedia articles explicitly outlines hyponym relations, e.g. in "The cat is a [domestic species of small carnivorous] mammal" 2 .
Summarising the results, the fact remains that according to our probes no model covers the relations reliably. If this observation is confirmed by further experiments, it supports the case for a combination of neural networks, traditional relational knowledge bases, and inference engines. With this architecture, established facts could be retrieved from the knowledge base and complemented by less certain facts deduced by the network to cover up for missing information without causing inconsistencies. The results also indicate that a significant threshold should be applied for transferring relational knowledge derived from an embedding to a knowledge base, if this should be done at all, to avoid large error propagation. This is especially important if the "facts" in the knowledge base are considered to be absolute truths rather than tentative findings.
In conclusion, the reliability of the probe could improve with evaluation sets from relations found in knowledge bases, and a correlational study between probing accuracy and downstream NLP tasks could further support the usefulness of studying these relations.