Inspecting the concept knowledge graph encoded by modern language models

The field of natural language understanding has experienced exponential progress in the last few years, with impressive results in several tasks. This success has motivated researchers to study the underlying knowledge encoded by these models. Despite this, attempts to understand their semantic capabilities have not been successful, often leading to non-conclusive, or contradictory conclusions among different works. Via a probing classifier, we extract the underlying knowledge graph of nine of the most influential language models of the last years, including word embeddings, text generators, and context encoders. This probe is based on concept relatedness, grounded on WordNet. Our results reveal that all the models encode this knowledge, but suffer from several inaccuracies. Furthermore, we show that the different architectures and training strategies lead to different model biases. We conduct a systematic evaluation to discover specific factors that explain why some concepts are challenging. We hope our insights will motivate the development of models that capture concepts more precisely.


Introduction
Natural language processing (NLP) encompasses a wide variety of applications such as summarization (Kovaleva et al., 2019), information retrieval (Zhan et al., 2020), and machine translation (Tang et al., 2018), among others. Currently, the use of pretrained language models has become the de facto starting point to tackle most of these tasks. The usual pipeline consists of finetuning a pre-trained language model by using a discriminative learning objective to adapt the model to the requirements of each task. As key ingredients, these models are pretrained using massive amounts of unlabeled data that can include millions of documents and billions of parameters. Massive data and parameters are supplemented with a suitable learning architecture, resulting in a highly powerful but also complex model whose internal operation is hard to analyze.
The success of pre-trained language models has driven the interest to understand the mechanisms they use to solve NLP tasks. As an example, in the case of BERT (Devlin et al., 2019), one of the most popular pre-trained models based on the Transformer (Vaswani et al., 2017), several studies have attempted to access the knowledge encoded in its layers and attention heads (Tenney et al., 2019b;Devlin et al., 2019;Hewitt and Manning, 2019). In particular, Jawahar et al. (2019) shows that BERT can solve tasks at a syntactic level by using Transformer blocks to encode a soft hierarchy of features at different levels of abstraction. Similarly, Hewitt and Manning (2019) show that BERT is capable of encoding structural information from text. In particular, using a structural probe, they show that syntax trees are embedded in a linear transformation of the encodings of BERT.
In general, previous efforts have provided strong evidence indicating that current pre-trained language models encode complex syntactic rules. However, relevant evidence about their abilities to capture semantic information remains still elusive. As an example, Si et al. (2019) attempts to locate the encoding of semantic information as part of the top layers of Transformer architectures finding contradictory evidence. Similarly, Kovaleva et al. (2019) focuses on studying knowledge encoded by self-attention weights. Their results provide evidence for over-parameterization but not about language understanding capabilities.
In this work, we study to what extent pre-trained language models encode semantic information. As a key source of semantic knowledge, we analyze their ability to encode the concept relations embedded in the conceptual taxonomy of Word-Net 1 (Miller, 1995). Understanding, organizing, and correctly using concepts is one of the most remarkable capabilities of human intelligence (Lake et al., 2017). Therefore, quantifying the ability that a pre-trained language model can exhibit to encode the conceptual organization behind WordNet is highly valuable. This knowledge may provide useful insights into the inner mechanisms that these models use to encode semantic information. Furthermore, identifying what they find difficult can provide relevant insights into how to improve them.
Unlike most previous works, we do not focus on a particular model but target a large list of the most popular pre-trained language models. In this sense, one of our goals is to provide a comparative analysis of the benefits of different approaches. Following Hewitt and Manning (2019), we study semantic performance by defining a probing classifier based on concept relatedness according to WordNet. Using this tool, we analyze the different models, enlightening how and where semantic knowledge is encoded. Furthermore, we explore how these models encode suitable information to recreate the structure of WordNet. Among our main results, we show that the different pre-training strategies and architectures lead to different model biases.
In particular, we show that contextualized word embeddings, such as BERT, encode high-level concepts and hierarchical relationships among them, creating a taxonomy. This finding corroborates previous work results ( Reif et al., 2019) that claim that BERT vectors store sub-spaces that correspond with semantic knowledge. Our study also shows evidence about the limitations of current pre-trained language models, demonstrating that they have difficulties to encode specific concepts. For example, all the models struggle with concepts related to "taxonomical groups". Our results also reveal that models have distinctive patterns regarding where in the architecture they encode the semantic information. These patterns are dependant on architecture and not on model sizes.

Study methodology
Probing methods consist of using the representation of a frozen pre-trained model to address a particular task. If the probing classifier succeeds in this setting but fails using an alternative model, it means that the source model encodes the knowledge needed to solve the task. Furthermore, the classifier's performance can be used to measure how well the model captures this knowledge (Conneau et al., 2018). We use a probing method at the semantic level applying it to the nine models presented in Section 2.2. Our study sheds light on whether the models encode relevant knowledge to predict concept relatedness in Wordnet.
To study how accurately the models encode semantic information, we measure correctness in predicted relations among concepts at two levels: (a) pair-wise-level by studying performance across sampled pairs of related or unrelated concepts, and (b) graph-level by using pair-wise predictions to reconstruct the actual graph. We describe both approaches in Sections 2.3 and 2.4, respectively.

WordNet splits and sampling
We partitioned the available WordNet synsets at 70/15/15 for training, validation and test sets respectively. Our experimental setup ensures no overlap in concepts among these sets. As an example, if the concept related to "house" fell in the training set, then all its lemmas are considered in this partition (e.g. "home", "residence", etc.), and neither this concept nor those lemmas will be present in the validation or test sets. Our sampling setup also balances the number of times each concept acts as hypernym or as a hyponym in the relation, whenever possible. Thus the benefit of learning whether a word is a "prototypical hypernym", as pointed out by Levy et al. (2015), is close to zero. Further details are available in Appendix A.1.

Word embedding models
This study considers the most influential language models from recent years. We consider the essential approaches of three model families: non contextualized word embeddings (NCE), contextualized word embeddings (CE), and generative language models (GLM). We consider Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) for the first family of approaches. For the CE family, we consider ELMo (Peters et al., 2018b), which is implemented on a bidirectional LSTM architecture, XLNet (Yang et al., 2019), and BERT (Devlin et al., 2019) and its extensions ALBERT (Lan et al., 2020) and RoBERTa (Liu et al., 2019), all of them based on the Transformer architecture. GPT-2 (Radford et al., 2018) and T5 (Raffel et al., Figure 1: Inputs to the edge probing classifier correspond to the model embeddings M (x) and M (y) of concepts x and y, respectively. M (x) and M (y) are projected into a common lower dimensionality space using a linear layer. The resulting embeddings x and y are concatenated and fed into a Multi-Layer Perceptron that is in charge of predicting if the concept pair is related or not. 2019) are included in the study to incorporate approaches based on generative language models.
For models in the CE and GLM families, the embedding is extracted after running the model on a sentence where the concept is used in context. Then we discard the context and keep only the first token that correspond to the specific mention of the concept. Finally we concatenate the hidden states of every layer of the model, for the selected token.

Semantic probing classifier
We define an edge probing classifier that learns to identify if two concepts are semantically related. To create the probing classifier, we retrieve all the glosses from the Princeton WordNet Gloss Corpus 2 . This dataset provides WordNet's synsets gloss sentences with annotations identifying occurrences of concepts within different sentence contexts. The annotations provide a mapping of the used words to their corresponding WordNet node. We sample hypernym pairs A, B. Then, from an unrelated section of the taxonomy, we randomly sample a third synset C, taking care that C is not related to either A or B. Then, A, B, C forms a triplet that allows us to create six testing edges for our classifier. To train the probing classifier, we define a labeled edge {x, y, L}, with x and y synsets in {A, B, C}, x = y. L ∈ {0, 1} is the target of the edge. If y is direct or indirect parent of x, L = 1, while L = 0 in other case. For each synset x, y, we sample one of its sentences S(x), S(y) from the dataset. Let M be a model. If M belongs to the NCE family, x and y are encoded by M (x) and M (y), respectively. If M belongs to the CE or GLM families, then x and y are encoded by the corresponding token of M (S(x)) and M (S(y)),

respectively.
To facilitate the evaluation of embeddings of different sizes, we first project each concept's encodings x and y into a low dimensionality space using a linear layer (see Figure 1). These vectors, denoted as x and y , are concatenated and fed into a Multi-Layer Perceptron (MLP) classifier. The linear layer and the MLP are the only trainable parameters of our setting, as we use the source model weights without any finetuning. Throughout all the experiments we used an MLP classifier with a single hidden layer of 384 hidden units.
We use this MLP to learn the structural relation between concept pairs, providing the test with a mechanism that allows the embeddings to be combined in a non-linear way. Tests based on linear transformations such as the one proposed by Hewitt and Manning (2019) did not allow us to recover the WordNet structure. This indicates that the sub-spaces where the language models encode semantics are not linear. The fact that syntactic information is linearly available suggests that syntax trees might be a critical intermediate result for the language modeling task. In contrast, semantic information emerges as an indirect consequence of accurate language modeling. Still, it might not constitute information that the model relies on for NLP tasks, as postulated by Ravichander et al. (2020).
To discard the possibility of the MLP being memorizing properties of words and thus giving an undeserved credit to the analyzed models, we generated alternative training and validation sets with random word embeddings of the same size as the real ones. During training and inference, these vectors were kept frozen. These tests showed around 50% accuracy in the binary classification task, indicating that the MLP cannot do better than chance in that scenario. Thus, if in a later experiment the same MLP  succeeds at the task, the merit can be attributed to the input embedding itself. This result is consistent with the fact that our experimental setup ensures no overlap in concepts among training, development, and testing sets.

Reconstructing the structure of a knowledge graph
The probe classifier predicts if a pair of concepts u, v form a valid parent, child relation according to WordNet, where h u,v ∈ [0, 1] denotes the corresponding classifier output. It is important to note that valid parent, child relations include direct relations (e.g. dog, poodle ), and transitive relations (e.g. animal, poodle ), and that the order of the items matters.
To reconstruct the underlying knowledge graph, for each valid parent, child relation given by h u,v > threshold, we need an estimation of how close are the nodes in the graph. We do this by introducing the concept of "parent closeness" between a parent node u and a child node v, denoted by d e (u, v). We propose two alternative scores to estimate d e : i) Model Confidence Metric (MCM): All the models considered in this study capture close relations more precisely than distant relations (supporting evidence can be found in Appendix D). This means that a concept like poodle will be matched with its direct parent node dog with higher confidence than with a more distant parent node (e.g. animal). Thus, we can define d e (u, v) = 1 − h u,v .
ii) Transitive Intersections Metric (TIM): We explore a metric grounded directly in the tree structure of a knowledge graph. Note that nodes u and v that form a parent-child relation have some transitive connections in common. Specifically, all descendants of v are also descendants of u, and all the ancestors of u are also ancestors of v. Then, the closer the link between u and v in the graph, the bigger the intersection. Accordingly, for each edge e = u, v , we define d e (u, v) as: where the first term of the sum accounts for the similarity within the descendants of nodes u and v, and the second term accounts for the similarity within the ancestors of nodes u and v. The term h u,v at the right-hand side accounts for the edge direction, and N denotes the set of nodes (concepts).
A strategy to find a tree that comprises each node's closest parents is the minimum-spanningarborescence (MSA) of the graph defined using d e . The MSA is analogous to the minimum-spanningtree (MST) objective used by Hewitt and Manning (2019), but for directed graphs. The formulation of the MSA optimization problem applied to our proposal is provided in the Appendix A.3.

Semantic edge probing classifier results
used, GLM shows improved performance, suggesting that these models capture semantics earlier in the architecture, keeping their last layers for generative-specific purposes. In contrast, CE models degrade or maintain their performance when single layers are used. Note that Table 2 shows pair-wise metrics not graph metrics. As we are dealing with graphs, predicted edges are built upon related edges. Thus, drifts in small regions of the graph may cause large drifts in downstream connections. Furthermore, our setup balances positive and negative samples. However, the proportion of negative samples can be considerably larger in a real reconstruction scenario. As a consequence, we emphasize that these numbers must be considered together with the results reported in sections 3.2 and 4.

Extracting the Knowledge Graph
Predicting a knowledge graph has a complexity of at least O(N 2 ) in the number of analyzed concepts. In our case, this imposes a highly demanding computational obstacle because WordNet has over 82000 noun synsets. To accelerate experimentation and facilitate our analysis and visualizations, we focus on extracting a WordNet sub-graph comprising 46 nodes not seen during training or validation. These nodes are picked to include easily recognizable relations. We use the tree-edit-distance to evaluate how close are the reconstructed graphs to the target graph extracted from WordNet. Table 1 shows our results. Table 1 shows that graphs retrieved using CE models are closer to the target than graphs provided by NCE and GLM models. In particular, the best results are achieved by BERT, ALBERT, and XLNet, indicating that these models encode more accurate semantic information than the alternative models. These results are consistent with those obtained in Section 3.1. The graphs for all the models can be found in Appendix C. this intuition, as concepts with a higher number of sub-classes have higher F1-scores. Figure 3-c shows that models degrade their F1-score when concepts are too frequent. In particular, NCE and GLM models are more sensitive to this factor.
Another finding is that CE and GLM models are almost unaffected by the number of senses that a certain word has, neither to their sense ranking or their number of sibling concepts, displaying almost flat charts (see ). This result suggests that these models pay more attention to the context than to the target word. This behavior is opposed to what NCE models exhibit according to Yaghoobzadeh et al. (2019), as NCE models tend to focus more on frequent senses.
In most cases, the same family models have similar behaviors, especially within the NCE or CE families. Also, different families show different patterns. Table 3 shows some salient examples. Surprisingly, all models struggle in the category "taxonomic groups". Manual inspection of sentences makes us believe that the context confuses CE and GLM models in these cases. In many sentences, the corresponding concept could be nicely replaced by another, conveying a modified but still valid message. This phenomenon does not occur in other categories such as "social group" or "attribute", even though these concepts are closely related to "taxonomic groups".

Where is this knowledge located?
As mentioned in Section 7, prior work has not shown consensus about where is semantic information encoded inside these architectures. Our experiments shed light on this subject. Figure 4 shows how each layer contributes to the F1-score.
Figures 4-a and 4-b show the performance across layers for the CE-based models. They reveal that while BERT and RoBERTa use their top-layers to encode semantic information, XLNet and AL-BERT use the first layers. Figure 4-c shows that while GPT-2 uses all its layers to encode semantics, T5 shows an M shape related to its encoder-decoder architecture. The chart shows that T5 uses its encoder to hold most of the semantic information. We also note that small models show similar patterns as their larger counterparts.   for contextual models such as CE. We hypothesize that stronger inductive biases are required to capture low-frequency concepts. Furthermore, we believe that new learning approaches are needed to discriminate accurate meaning for high-frequency concepts. As expected, our findings indicate that model families have different biases leading to different behaviors. Thus, our results can illuminate further research to improve semantic capabilities by combining each family of models' strengths. For example, one could combine them as ensembles, each one equipped with a different loss function (i.e., one generative approach resembling GLMbased methods and another discriminative resembling CE-based methods).

Further discussion and implications
Findings (4), (5), and (6) suggest that instead of a standard finetuning of all layers of BERT according to a given downstream task, to improve semantic capabilities, one could perform a task profiling to decide the best architecture for the task and also how to take advantage of it. Using only a limited number of layers or choosing a different learning rate for each layer, one could exploit the semantic knowledge that the pre-trained model carries, avoiding the degradation of this information present at the top layers, especially when using T5, XLNet, or ALBERT-large. Accordingly, recent work on adaptive strategies to output predictions using a limited number of layers (Xin et al., 2020;Liu et al., 2020;Hou et al., 2020;Schwartz et al., 2020;Fan et al., 2020;Bapna et al., 2020) would benefit from using architectures that encode knowledge in the first layers. To the best of our knowledge, these works have only used BERT and RoBERTa, achieving a good trade-off between accuracy and efficiency. Only Zhou et al. (2020) has explored ALBERT, reporting improved accuracy by stopping earlier. Our findings explain this behavior and suggest that T5 or XLNet may boot their results even further as these architectures have sharper and higher information peaks in their first layers. Findings (7) and (8) suggest that recent success in semantic NLP tasks might be due more to the use of larger models than large corpora for pretraining. This also suggests that to improve model performance in semantic tasks, one could train larger models even without increasing the corpus size. A similar claim has been proposed by (Li et al., 2020) leading to empirical performance improvements.
Finally, finding (9) is important because it suggests that contextual models pay as much attention to the context as to the target word and are probably Supporting Finding Evidence Involved Models (1) All models encode a relevant amount of knowledge about semantic relations in WordNet, but this knowledge contains imprecisions.
All All (2) The ability to learn concept relations depends on how frequent and specific the concepts are. Some model families are more affected. (3) Concept difficulty is usually homogeneous within each model family. Some semantic categories challenge all models. Table 3 All (4) Some models encode stronger semantic knowledge than others, usually according to their family.
Tables 2, 1, 3 ELMo, BERT, RoBERTa, ALBERT, XLNet, T5 (5) Some models focus their encoding of semantic knowledge in specific layers, and not distributed across all layers.   (7) Model size has an impact in the quality of the captured semantic knowledge, as seen in our layer-level probe tests.   (Tenney et al., 2019a), semantic roles (Rogers et al., 2020), and sentence completion (Ettinger, 2020), other studies show less favorable results in coreference (Tenney et al., 2019b), Multiple-Choice Reading Comprehension (Si et al., 2019) and Lexical Relation Inference (Levy et al., 2015), claiming that BERT's performance may not reflect the model's true ability of language understanding and reasoning. Tenney et al. (2019b) proposes a set of edge probing tasks to test the encoded sentential structure of contextualized word embeddings. The study shows evidence that the improvements that BERT and GPT-2 offer over non contextualized embeddings as GloVe is only significant in syntactic-level tasks. Regarding static word embeddings, Yaghoobzadeh et al. (2019) shows that senses are well represented in single-vector embeddings if they are frequent and that this does not harm NLP tasks whose performance depends on frequent senses.
Layer-wise or head-wise information: Tenney et al. (2019a) shows that the first layers of BERT focus on encoding short dependency relationships at the syntactic level (e.g., subject-verb agreement). In contrast, top layers focus on encoding long-range dependencies (e.g., subject-object dependencies). Peters et al. (2018a) supports similar declarations for Convolutional, LSTM, and self-attention architectures. While these studies also support that the top layers appear to encode semantic information, the evidence to support this claim is not conclusive or contradictory with other works. For example, Jawahar et al. (2019) could only identify one SentEval semantic task that topped at the last layer. In terms of information flow, Voita et al. (2019a) reports that information about the past in left-to-right language models gets vanished as the information flows from bottom to top BERT's layers. Hao et al. (2019) shows that the lower layers of BERT change less during finetuning, suggesting that layers close to inputs learn more transferable language representations. Press et al. (2020) shows that increasing self-attention at the bottom layers improves language modeling performance based on BERT. Other studies focus on understanding how self-attention heads contribute to solving specific tasks (Vig, 2019). Kovaleva et al. (2019) shows a set of attention patterns repeated across different heads when trying to solve GLUE tasks . Furthermore, Michel et al. (2019) and Voita et al. (2019b) show that several heads can be removed without harming downstream tasks.

Automated extraction of concept relations:
Although the main focus of our work is not to master the probing task of extracting knowledge from WordNet, but to use it as an instrument to verify and compare the abilities of current families of language models to encode this kind of knowledge, for completitude we include a brief mention of previous literature regarding this subject. Relation extraction is an active research topic. Early works are either feature-based, usually relying on SVMs, Maximum Entropy, or on a set of manually defined rules (Hearst, 1998;Kambhatla, 2004;Dashtipour et al., 2017;Minard et al., 2011;Weeds et al., 2014;Chen et al., 2015). Other methods rely on manually defined distance metrics to estimate the relatedness of two semantic instances (Dandan et al. Alternative approaches: Several alternative approaches have been used in previous works. Some are dataset-focused (Miller et al., 1994;Levy et al., 2015;Wiedemann et al., 2019), usually relying on annotated corpora that challenge semantic abilities. These approaches have provided useful insights, but usually suffer from low availability of data as they usually cover a small fraction of the WordNet ontology. As an example, BLESS (

Conclusions
In this work, we exploit the semantic conceptual taxonomy behind WordNet to test the ability of current families of pre-trained language models to learn semantic knowledge from massive sources of unlabeled data. Our main conclusion is that, indeed, to a significant extent, these models learn relevant knowledge about the organization of concepts in WordNet, but also contain several imprecisions. We also notice that different families of models present dissimilar behavior, suggesting the encoding of different biases.
We hope our study helps to inspire new ideas to improve the semantic learning abilities of current pre-trained language models. A Implementation details

A.1 Edge probing classifier details
To study the extent to which these Language Models deal with semantic knowledge, we extend the methodology introduced by Tenney et al. (2019b).
In that study, the authors defined a probing classifier at the sentence level, training a supervised classifier with a task-specific label. The probing classifier's motivation consists of verifying when the sentence's encoding help to solve a specific task, quantifying these results for different word embeddings models. We cast this methodology to deal with semantic knowledge extracted from WordNet. Rather than working at the sentence level, we define an edge probing classifier that learns to identify if two concepts are semantically related.
To create the probing classifier, we retrieve all the glosses from the Princeton WordNet Gloss Corpus. The dataset provides WordNet's synsets gloss with manually matched words identifying the context-appropriate sense.
As a reference of size, the selected annotations in the corpus accounted for 41502 lemmas, corresponding to 34371 WordNet synsets. This resulted in 230215 valid WordNet relations.
In WordNet, each sense is coded as one of the synsets related to the concept (e.g., sense tendency.n.03 for the word tendency). Using a synset A and its specific sense provided by the tagged gloss, we retrieve from WordNet one of its direct or indirect hypernyms, denoted as B (see Figure  5). If WordNet defines two or more hypernyms for A, we choose one of them at random. We sample a third synset C, at random from an unrelated section of the taxonomy, taking care that C is not related to either A or B (e.g., animal.n.01). Then, A, B, C form a triplet that allows us to create six testing edges for our classifier: A, B , which is compounded by a pair of related words through the semantic relation hypernym of, and five pairs of unrelated words ( A, C , B, C , B, A , C, A , C, B ). We associate a label to each of these pairs that show whether the pair is related or not (see Figure 5). Note that we define directed edges, meaning that the pair A, B is related, but B, A is unrelated to the relationship hypernym of. Accordingly, the edge probing classifier will need to identify the pair's components and the order in which the concepts were declared in the pair.
We create training and testing partitions ensuring that each partition has the same proportion of leaves versus internal nodes. The latter is essential to identify related pairs. During training, we guarantee that each training synset is seen at least once by the probing classifier. To guarantee the above, we sample each synset in the training set and sample some of its hypernyms at random. Then. we randomly sample some unrelated synset for each related pair that has no relation to any of the words in the related pair. We create three partitions from this data on 70/15/15 for training, development, and testing foldings, respectively.
We train the MLP classifier using a weighted binary cross-entropy loss function. Since we have one positive and five negative examples per triplet, we use a weighted loss function with weights 5 and 1 for the positive and negative class, respectively. Accordingly, positive and negative examples have the same relevance during training. We implemented the linear layer and the MLP classifier using a feed forward network with 384 hidden units. The MLP was trained using dropout at 0.425 and a L 2 regularizer to avoid overfitting.
To create the vector representations for each of the word embeddings models considered in this study, we concatenate the hidden state vectors of all the layers for each tagged synset. For both CE and GLM-based models, each gloss was used as a context to build specific contextual word embeddings. If the gloss has more than one tagged token, we take only the first of them for the analysis.

A.2 WordNet metrics: distance
Lets say that we name "Case-1" if y is ancestor of x, and "Case-2" otherwise. Let d W (x, y) be the Wordnet distance between two synsets x, y, defined by: where d path (x, y) is the length of the shortest path between x and y in WordNet, measured in number of hops, and z is the closest common ancestor of x and y in the case that y is not an ancestor of x.

A.3 Minimum-Spanning-Arborescence optimization problem
Given a graph G with nodes N and unknown edges E, we define an auxiliary graph G with nodes N and edges E , comprised of all possible directed edges. For each edge e ∈ E , we obtain a prediction h e that estimates the probability of that < , > → < > < , > → < > < , > → < > < , > → < > < , > → < > < , > → < >  Figure 5: Each triplet is used to create related and unrelated pairs of words according to the relationship hypernym of. We create six edge probing pairs, and therefore, the edge probing classifier will need to identify the pair's components and the order in which the words were declared in the pair.
edge representing a valid hypernymy relation, and a distance d e that estimates the "parent closeness" 3 between the nodes in G.
We define δ(v) to be the set of edges { u, v : u ∈ N, u = v} where edge u, v represents a parent, child relation. We also define γ(S) to be the set of edges { u, v ∈ E : u / ∈ S, v ∈ S}. We estimate the graph topology of G defined by E ⊂ E by solving the following optimization problem: Objective function (3) is used to find the best root node r; and the nested optimization problem (5) is the minimum spanning arborescence problem applied to the dense graph G . The final binary values of x e estimate E by indicating if every possible edge e exist in the graph or not. To solve this optimization problem, we need estimates of h e and d e for each edge e. We use the output of the probing classifier as an estimate of the probability of h e , and use TIM and MCM scores as estimates for d e (See Section 2.4). 3 The value of this distance will be small if the hypernym relation is close, or large if it is distant or not valid.   Relative depth in the WordNet graph: (Figure  3-a). For each synset, we compared F1 with depth score (0 % for the root and 100 % for leaves) measuring differences between higher/lower level concepts.
Concept frequency: In Figure 3-c we evaluate if frequent concepts are easier or harder to capture for these models. The frequency was computed by counting occurrences in the 38 GB of OpenWebText Corpus (http://Skylion007. github.io/OpenWebTextCorpus).
Number of Senses and Sense Ranking: (Figure 3-d-e) We studied if models are impacted by multi-sense concepts such as "period", and by their sense ranking (how frequent or rare those senses are). Surprisingly contextualized models, and specially CE models have no significant impact by this factor, suggesting that these models are very effective at deducing the correct sense based on their context. These charts also suggest that these models may be considering context even more than the words themselves. This is intuitive for Masked-Language-Models such as BERT, but not for others, such as GPT-2. Non-contextualized models are impacted by this factor, as expected. GloVe-42B GPT2-XL T5-large ELMo-large BERT-large RoBERTa-large XLNet-large ALBERT-xxlarge Figure 12: Graph distance between concepts: We measured the impact of the number of "hops" that separate two tested concepts on pair-wise F1 score. This chart reveals a strong correlation of all the models in this aspect. As an example of this phenomenon, closer relations such as chihuahua, dog are, in general, considerably easier to capture than distant relations such as chihuahua, entity . ± .0967 ± .0809 ± .0755 ± .0567 ± .0828 ± .0952 ± .0919 ± .0862 ± .0889 Table 6: Each value represents the mean F1-score and standard deviation of all the concepts that belong to each analyzed category. Only the larger version of each model is reported. This is not an extensive list and categories are somewhat imbalanced. Categories were selected based on the number of sub-categories they contained.