Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries

Pretrained language models have been suggested as a possible alternative or complement to structured knowledge bases. However, this emerging LM-as-KB paradigm has so far only been considered in a very limited setting, which only allows handling 21k entities whose name is found in common LM vocabularies. Furthermore, a major benefit of this paradigm, i.e., querying the KB using natural language paraphrases, is underexplored. Here we formulate two basic requirements for treating LMs as KBs: (i) the ability to store a large number facts involving a large number of entities and (ii) the ability to query stored facts. We explore three entity representations that allow LMs to handle millions of entities and present a detailed case study on paraphrased querying of facts stored in LMs, thereby providing a proof-of-concept that language models can indeed serve as knowledge bases.


Introduction
Language models (LMs) appear to memorize world knowledge facts during training. For example, BERT (Devlin et al., 2019) correctly answers the query "Paris is the capital of [MASK]" with "France". This observation prompted Petroni et al. (2019) to ask if LMs can serve as an alternative or complement to structured knowledge bases (KBs), thereby introducing the idea of treating LMs as KBs: During training, the LM encounters world knowledge facts expressed in its training data, some of which are stored in some form in the LM's parameters. After training, some of the stored facts can be recovered from the LM's parameters by means of a suitable natural language query (Fig. 1). A LM with such a "built-in" KB is useful for knowledge-intensive tasks (Petroni et al., 2020) and question answering (Roberts et al., 2020), and could improve natural language interfaces to structured data (Hendrix et al., 1978;Herzig et al., 2020). However, this emerging LM-as-KB paradigm is faced with several foundational questions. First question: KBs contain millions of entities, while LM vocabulary size usually does not exceed 100k entries. How can millions of entities be represented in LMs? Petroni et al. (2019) circumvent this problem by only considering 21k entities whose canonical name corresponds to a single token in the LM's vocabulary, e.g., entities like "France" or "Bert", but not "United Kingdom" or "Sesame Street". Hence, this approach cannot handle entities not contained in the vocabulary and a query like "Bert is a character on [MASK]" is not answerable in this simplified setting. To answer this first question, we compare three methods for scaling LM-as-KB to millions of entities: bolic representation allows the most accurate storage, but is computationally expensive and requires entity-linked training data. Surface representation is computationally efficient and does not require entity-linked data, but is less accurate, especially for longer entity names. Continuous representation also requires entity-linked data, but is computationally more efficient than symbolic representation.
Second question: What is the capacity of LMs for storing world knowledge? Can a LM store, say, all relation triples contained in a KB like Wikidata (Vrandečić and Krötzsch, 2014)? Here we conduct experiments using synthetic data to study the scaling behaviour of current LM architectures. Varying the number of trainable model parameters and recording the number of relation triples memorized at a given accuracy level, we find that, e.g., a Transformer (Vaswani et al., 2017) with 125 million parameters (12 layers of size 768), has the capacity to memorize 1 million Wikidata relation triples with 95 percent accuracy or 5 million relation triples with 79 percent accuracy. Assuming linear scaling, this finding suggests that larger LMs with tens or hundreds of billions of parameters (Raffel et al., 2019;Brown et al., 2020) can be used to store sizable parts, if not all, of a KB like Wikidata.
Third question: How robustly is world knowledge stored in LMs? Is the LM able to recall a fact even if the query is slightly different than what was memorized during training? For example, if the LM memorized "Barack Obama was born in Hawaii" during training, can it answer queries like "Barack Obama is from [MASK]" or "Where was Barack Obama born? [MASK]"? Here we conduct experiments to measure how well the LM transfers knowledge from memorized statements to query variants, both in a zero-shot setting in which the model is not exposed to the target query variant during training, and a few shot setting, in which the model is finetuned on a small number of statements containing the target query variant. We observe zero-shot transfer in case of highly similar query variants, and see successful few-shot transfer after finetuning with 5 to 100 instances in case of less similar queries. This ability to handle soft, natural language queries, as opposed to hard, symbolic queries in a language like SQL or SPARQL, is one of the key motivations for using LMs as KBs.
Contributions. We formulate two requirements for treating LMs as KBs: (i) the ability to store a large number of facts involving a large number of entities and (ii) the ability to query stored facts. After providing background on world knowledge in LMs ( §2), we make the following contributions: 1 • A comparison of entity representations for scaling LM-as-KB to millions of entities ( §3); • Empirical lower bounds on LM capacity for storing world knowledge facts ( §4); and • A controlled study of knowledge transfer from stored facts to paraphrased queries ( §5).

World Knowledge in Language Models
Large pretrained LMs have been the driver of recent progress in natural language processing (Peters et al., 2018;Howard and Ruder, 2018;Radford et al., 2019;Devlin et al., 2019). While the trend towards larger LMs is likely to continue (Raffel et al., 2019;Kaplan et al., 2020;Brown et al., 2020), it has limitations: (i) A model trained only on text lacks grounding in perception and experience and hence cannot learn meaning (Bender and Koller, 2020). (ii) Reporting bias leads to certain knowledge rarely or never being expressed in text. For example, a LM will easily learn to associate the phrase "Barack Obama" with the phrase "U.S. President", but might less likely learn that he is a "human being", since the latter fact is rarely stated explicitly in text. In contrast, this type of knowledge is readily available in KBs. (iii) A large number of rare entities (Hoffart et al., 2014;Derczynski et al., 2017;Ilievski et al., 2018) are, by definition, rarely mentioned, making it difficult for LMs to acquire knowledge about this long tail of entities from text alone. These limitations have motivated efforts to explicitly 2 equip LMs with world knowledge. Table 2 (Appx. A) situates these efforts on a spectrum from purely text-based LMs to representations of structured KBs. Models based on text generation (Raffel et al., 2019;Roberts et al., 2020) and retrieval (Guu et al., 2020) have proven most successful in knowledge-intensive tasks. However, we argue that models which reify entities , i.e., models in which entities are "first-class citizens" that can be directly predicted 3 , are a promising research direction, since the direct links into a KB can be seen as a form of grounding. This is one of our main motivations for considering symbolic and continuous entity representations.

Entity Representations
How can millions of entities be represented in a LM? To answer our first question, we compare three types of entity representations: symbolic, surface form, and continuous. Experimental setup. We evaluate entity representations by measuring how well they allow a LM to store and retrieve world knowledge facts. For example, if the LM's training data contains the statement "Bert is a character on Sesame Street", the model should memorize this statement and recall the correct object Sesame Street when asked with a query like "Bert is a character on [MASK]." Synthetic data. It is not a priori clear how many facts a text from the LM's training data, say, a Wikipedia article, expresses. Since we want to precisely measure how well a LM can store and retrieve facts, we create synthetic data by generating statements from KB relations and then train the model to memorize these statements. Using Wikidata as KB, we first define two sets of entities: A smaller set consisting of the top 1 million Wikidata entities according to node outdegree, and a larger set consisting of the roughly 6 million Wikidata entities that have an entry in the English Wikipedia.
Next, we manually create templates for the 100 most frequent Wikidata predicates. For example, for the predicate P19 ("place of birth"), we create the template S was born in O and generate English statements by filling the S and O slots with entities 2 As opposed to the LM acquiring world knowledge implicitly as a side effect of its training objective.
3 As opposed to generating or retrieving a surface form which may or may not correspond to an entity. from the sets defined above for which this relation holds. 4 To make queries for an object unique given subject and predicate, we arbitrarily select exactly one fact if there are multiple objects and discard the other facts. This process yields 5 million statements involving up to 1 million entities, and 10 million statements involving up to 6 million entities. These statements then serve as training instances, i.e., given the query "Barack Obama was born in [MASK]", the model should predict Hawaii. As our goal is to store facts in a LM, there is no distinction between training and test data. Models and training. We consider two common LM architectures: LSTMs (Hochreiter and Schmidhuber, 1997) and Transformers (Vaswani et al., 2017). For LSTMs, we compare two configurations; a randomly initialized two-layer LSTM with layers size 256 (LSTM 256) and one with layer size 1024 (LSTM 1024). For Transformers, we compare a pretrained RoBERTa-base , and RoBERTa without pretraining, i.e., a randomly initialized Transformer of the same size. For consistent tokenization across all four models, we subword-tokenize statements with the RoBERTa tokenizer. To store statements in a LM, we train until the model reaches 99 percent memorization accuracy, i.e., overfits the training data almost perfectly, or stop early if accuracy does not improve for 20 epochs. See Appx. D for training details.

Symbolic Representation
With symbolic representation, each entity is represented as an entry in the LM's vocabulary. Prediction is done via masked language modeling (Devlin et al., 2019), by encoding the query with the LM, projecting the final hidden state of the [MASK] token onto the vocabulary and then taking a softmax over the vocabulary. As the results show (Fig. 2), symbolic representation yields very high memorization accuracies with a vocabulary of 1 million entities. Randomly initialized RoBERTa-base without pretraining works best, memorizing 97 percent of 5 million statements correctly. Unfortunately, the softmax computation becomes prohibitively slow as the vocabulary size increases (Morin and Bengio, 2005), making symbolic representation with a softmax over a vocabulary consisting of the full set of 6 million Wikipedia entities impractical. Imposing a hierarchy is a common approach for dealing with large vocabularies, but did not work well in this case (See Appx. F.1).

Surface Form Representation
With surface form representation, each entity is represented by its canonical name. 5 Since this name generally consists of more than one token, we cast memorizing statements and querying facts as a sequence-to-sequence task (Sutskever et al., 2014): Given the source sequence "Bert is a character on [MASK]", the model should generate the target sequence "Sesame Street". 6 To make models memorize statements, we train until perplexity on the training data reaches 1.0 or does not improve for 20 epochs. For evaluation, we generate the target sequence -i.e., the answer to a given queryvia a beam search with beam size 10. We measure perfect-match accuracy of the full entity name, i.e., there is no credit for partial token matches.
The four models under comparison are now treated as sequence-to-sequence encoders and extended with a decoder of the same size: LSTM decoders for LSTM encoders (LSTM2LSTM) and randomly initialized Transformers for Transformer encoders (RoBERTa2Transformer, Trans-former2Transformer).
Unlike symbolic representation, surface representation can handle the entire set of 6 million Wikipedia entities. As with symbolic representation, the randomly initialized Transformer (Fig. 3, dash-dotted red line) has the highest capacity, memorizing 10 million statements with 90 percent accuracy.
A pretrained encoder (RoBERTa2Transformer) appears to have a deleterious effect, yielding lower accuracies than the randomly initialized Transformer2Transformer. While the larger LSTM2LSTM (layer size 1024) almost 5 We use English Wikidata labels as canonical names. 6 The [MASK] token is included since the target entity does not always occur at the end of a statement.  matches the performance of the best Transformer model, the smaller one (layer size 256) has insufficient capacity, memorizing less than 50 percent of 5 million statements. Analysis of the Transformer2Transformer model ( Fig. 4) reveals, perhaps unsurprisingly, that statements involving infrequent, long entity mentions are difficult to memorize. 7 For example, the model fails to memorize most entity mentions that occur only in one to ten statements and have a length of 12 or more subwords (blue cluster, upper left).

Continuous Representation
With continuous representation, an entity e i , i ∈ [1, N entities ] is represented by a d-dimensional embedding y i ∈ R d . After encoding a query with the LM, prediction is performed by projecting the final hidden state corresponding to the [MASK] token onto R d , obtaining the predicted embeddinĝ y ∈ R d . We use fixed, pretrained entity embeddings and train with cosine loss L = 1−cos(ŷ, y i ).
At test time, the model predictionŷ is mapped to the closest pretrained entity embedding y i via nearest-neighbor search (Johnson et al., 2017). Continuous prediction with fixed, pretrained embeddings. When training randomly initialized embeddings with a similarity objective, a degenerate solution is to make all embeddings the same, e.g., all-zero vectors. To prevent this, it is common practice to use negative samples (Bordes et al., 2013). When using fixed, pretrained embeddings as supervision signal, negative sampling is not necessary, since the target embeddings are not updated and therefore cannot become degenerate. Wikidata embeddings. We train embeddings for 6 million Wikidata entities using feature-specific autoencoders to encode entity features such as names, aliases, description, entity types, and numeric attributes, following prior work on multi-modal KB embeddings (Pezeshkpour et al., 2018) and KB embeddings with autoencoders (Takahashi et al., 2018). Embedding training is detailed in Appx. E.
Results. Fig. 5 shows memorization accuracies achieved with continuous representation. Like surface representation, continuous representation scales to 6 million entities, and we see the same relative order of models, but with overall lower accuracies. RoBERTa without pretraining has the highest capacity for storing world knowledge statements, memorizing 67 percent of 10 million statements, while the small LSTM 256 model has the lowest capacity, memorizing 42 percent. Although far from fully understood, sequence-to-sequence architectures are relatively mature, with highly-optimized toolkits and hyperparameter settings publicly available . In contrast, prediction of continuous representations is still in an early stage of research (Kumar and Tsvetkov, 2019). We therefore see these results as lower bounds for LM capacity with continuous representations. By design, memorization with continuous representations does not rely on entity names, and hence, in contrast to surface form representation, does not lead to difficulties in handling entities with long names. However, as with surface form representation, infrequent entities are more difficult to memorize than frequent ones. Most of the memorization errors ( Fig. 6, blue, left) involve infrequent entities with a median frequency of 3, while most of the correctly memorized statements (orange, right) involve entities that occur more than 100 times.

LM Capacity for Storing Facts
We now turn to the second question, how model capacity scales with model size (Fig. 7, top). With a 12-layer Transformer of layer size 96 or 192 (top subfigure, solid red and dashed green lines), memorization accuracy quickly drops as the number of facts to memorize increases. Larger models can memorize more facts, but accuracy drops rather quickly, e.g., to 65 percent of 3 million facts memorized with a layer size of 384 (dotted orange line).
Assuming a desired memorization accuracy of 80 percent, we record the maximum number of facts a model of a given size can memorize at this level (Fig. 7, bottom). For the model sizes considered here, storage capacity appears to scale linearly, with a model of layer size 384 (55M parameters) storing one million facts and a model of layer size 960 (160M parameters) storing 7 million facts.
Apart from the number of facts to store, we hypothesize that successful storage depends on two more factors: the number of entities and the en-   tropy of their distribution. As expected, a large number of entities makes memorization more difficult (Table 1). The number of entities has a small effect with surface representation (2 percent drop), but with continuous representation accuracy drops from 85 percent to 79 percent when the number of entities increases from 1 to 6 million. We also observe an impact of the entity distribution (Appx. G), but leave detailed analysis to future work.

Storage Efficiency
Our comparison of different entity representations ( §3) does not control for the number of trainable model parameters. That is, we selected common ar- chitectures, such as a Transformer with 12 layers of size 768, but made no effort to ensure that, e.g., the number of trainable parameters introduced by the softmax layer in a model with symbolic representation matches the number of trainable parameters introduced by the addition of a sequence-to-sequence decoder component in a model with surface form representation. In order to more fairly compare entity representations across models with differing numbers of trainable parameters, we formulate the storage efficiency of a model designed to memorize statements: Storage efficiency = #statements × accuracy #parameters This measure expresses the intuition that a model is efficient if it requires few parameters to memorize a large number of statements with high accuracy. When quantifying efficiency with this measure, we find that continuous representation is the most efficient ( Figure 8) and hence use this form of entity representation in the remainder of this work.

Querying Stored Facts
So far, we saw that it is possible to store millions of facts in a LM, by finetuning the model to predict the masked object of statements like Barack Obama was born in [MASK]. However, given the large number of model parameters and training effort, mere storage is not a compelling achievement: The underlying relations, here Barack Obama, was-BornIn, Hawaii , can be stored more compactly and with perfect accuracy in a structured KB. One of the potential benefits of the LM-as-KB paradigm is the LM's ability to handle paraphrases. If the LM's representation of the statement above is sufficiently similar to its representation of queries like Barack Obama is from [MASK] or Where is Barack Obama from? [MASK], this similarity could allow transfer from the memorized statement to these unseen queries. Is this soft querying of facts stored in a LM possible? We now conduct a controlled experiment to answer this question, expecting one of the following three outcomes: 1. Rote memorization. The model memorizes statements with little or no abstraction, so that even small, meaning-preserving changes to the query prevent the model from recalling the correct object.
2. Generic association. The model memorizes pairs of subject and object entities but disregards the predicate. For example, a model might predict Hawaii whenever the query contains the phrase Barack Obama, regardless of context. This pathological behaviour could be especially prevalent if the distribution of object entities co-occurring with a given subject is dominated by one object.
3. Fact memorization. The model memorizes facts expressed in statements by forming abstractions corresponding to entities and predicates. This allows retrieving a fact with a variety of queries.
Sections 3 and 4 already established that a model of sufficient size can perform rote memorization of millions of statements. We now design an experiment to test whether LMs are capable of fact memorization while taking care to distinguish this capability from generic association. Concretely, our goal is to test if a LM that has memorized a statement like Barack Obama was born in Hawaii can use this knowledge to answer a query like Barack Obama is from [MASK]. Conveniently, wasBornIn relations are among the most frequent in Wikidata and hold for a diverse set of subject and object entities. This diversity of entities makes this predicate a good candidate for our case study, since statements involving a predicate with a less diverse set of subject or object entities are easier to memorize. 8 Statements and controls. We sample 100k statements generated by the template "S was born in O". To allow distinguishing if a model that memorizes these 100k facts does so by generic assocation or by fact memorization, we introduce control facts. Given a fact S, P, O , its control S, P', O' involves the same subject S, but a distinct predicate P' and object O'. For example, a control for the fact Albert Einstein, wasBornIn, Ulm is the fact Albert Einstein, diedIn, Princeton . We add 100k control statements generated from the template "S died in O"' and train RoBERTa-base to memorize all 200k statements with 98 percent accuracy. The combination of statements and controls counters generic association: To correctly answer the query "Albert Einstein died in [MASK]", the model needs to take into account the predicate, since two distinct objects are associated with Albert Einstein. Query variants. Next, we collect query variants, such as "S is from O" (row labels in Fig. 9, top). Expecting good transfer for variants that are similar to the original statement, we include variants with small changes, such as varying punctuation.
As more diverse variants, we select frequent relation patterns, such as "S (b. 1970, O)", from the GoogleRE Corpus (Google, 2013), as well as a query in question form and queries with irrelevant or misleading distractors such as "S was born in O, but died somewhere else". For each variant, we generate 100k queries by filling the S and O slots with the same entity pairs as the original statements.
To balance statements and controls, we create control templates (row labels in Fig. 9, bottom) and generate a matching number of control statements. Transfer results. We evaluate knowledge transfer from memorized statements to query variants using RoBERTa-base ( Fig. 9, top, left), measuring accuracy over the 100k statements generated with a target query variant template. To measure the effect of pretraining on transfer ability, we compare to RoBERTa-base without pretraining ( Fig. 9, top, right). We consider zero-shot transfer without any finetuning towards the target query variant, and a finetuning setting, in which the LM is first trained to memorize all 100k original statements and then finetuned until it memorizes a small number of statements in the target query format. 9 In the zero-shot setting (leftmost column), even small changes to the query lead to a drop in fact recall: Adding an ellipsis (4th row) causes the model to answer 95% of queries correctly, a 3% drop from 98% memorization of the original statements (first row). Adding an exclamation mark (5th row) results in a 8% drop. For other paraphrases, e.g., S, who is from O (7th row) and S is from O, zero-shot transfer works only in 35% and 20% of cases, and  the question format (11th row) allows zero-shot transfer with 32% accuracy. For the remaining paraphrases, e.g., those with parentheticals or the distractor died, zero-shot transfer is poor, with accuracies ranging from 3% to 13%.
A clear trend is visible: transfer works best for similar statements and worst for dissimilar ones. To quantify this trend, we compute a representation of a statement template by averaging over its 100k mean-pooled, LM-encoded statements, and then measure the Euclidean distance between the original template representation and the representation of a query variant template. Correlating Euclidean distance and accuracy of zero-shot transfer obtains a Pearson coefficient of −0.68, indicating a strong negative correlation. That is, transfer tends to work well for paraphrased queries the LM deems similar to the originally memorized statement, but fails if the LM's representation of a query is too dissimilar to its representation of the original statement. This trend is also reflected in the finetuning setting, with less similar variants requiring up to 500 instances until the model achieves 90 percent accuracy (last row), while transfer to more similar variants works well after finetuning on 5 to 50 target instances.
When using RoBERTa without pretraining to memorize statements, knowledge transfer to query variants is much worse. While transfer still works for the most similar variants (right, top rows), less similar variants require more finetuning compared to pretrained RoBERTa (right, middle rows). Transfer does not work for the least similar variants, with accuracies as low as 1 to 4 percent even after finetuning with 500 instances (right, bottom rows). Similar results for control statements are shown in Fig. 9 (bottom). We take these results as evidence that pretraining enables LMs to handle paraphrased queries and that LMs can memorize facts beyond mere rote memorization and generic association.

Limitations and Conclusions
Limitations. This work is not without limitations. We only use one KB in our experiments. Arguably, as the largest publicly available source of world knowledge, Wikidata is the most promising resource for equipping LMs with such knowledge, but attempts to store a KB with different structure might result in different outcomes, since some types of graphs are easier to memorize for a LM than others (See Appx. G).
While we use language like "train a LM to memorize statements" for simplicity throughout this work, what we do in case of pretrained LMs is more akin to adaptive pretraining (Gururangan et al., 2020). It is possible that integrating entity supervision directly into LM pretraining (Févry et al., 2020) allows more efficient fact storage.
Our analysis was focused on entity representations and ignored the question of how to represent relation predicates or entire relation triples. Here, relation learning (Baldini Soares et al., 2019) and LM pretraining on fact-aligned corpora (Elsahar et al., 2018) are avenues for future work.
Finally, we formulated the LM-as-KB paradigm in terms of storing and retrieving relation triples. While structured KBs such as Wikidata consist of such triples and hence our experiments showing storage and retrieval of triples LMs are sufficient as a proof-of-concept in principle, structured KBs allow more complex queries than the ones considered here, such as 1-to-n relations, multihop inference, queries involving numerical ranges, or facts qualified by time and location (Hoffart et al., 2013). Conclusions. We gave a positive answer to Petroni et al. (2019)'s question if language models can serve as knowledge bases. Arguing that treating LMs as KBs requires representing a large number of entities, storing a large number of facts, and the ability to query a fact with a variety of queries, we showed that current LM architectures fulfill these requirements when extended with a compo-nent for representing entities. In addition to the ability to handle paraphrased queries, we envision further benefits from the LM-as-KB paradigm. For example, the fact-memorization and paraphrasefinetuning setting introduced in Section 5 allows precise control over which facts a LM learns.

Acknowledgments
We thank the anonymous reviewers for helpful feedback. This work was supported by a Google Focused Research Award.  (Shannon, 1948;Elman, 1990;Bengio et al., 2003)     We train the embedding of a given Wikidata entity by collecting its features from, encoding each feature to obtain a dense feature representation, and then concatenating feature representations. For textual features, we use RoBERTa-base as encoder and train corresponding decoders in a standard sequence-tosequence auto-encoding setup. For quantities, we select the 100 most common quantity types to obtain a fixed-sized representation and then follow a standard auto-encoding setup. Similarly we obtain a fixed-size entity type representation by selecting the 1000 most common entity types. The concatenated featurerepresentations are then compressed to embedding size d, using a separate autoencoder. Preliminary experiments with embedding sizes d ∈ {64, 128, 192, 256} showed similar memorization accuracies for all d, but faster convergence for smaller sizes. We set d = 64 in our main experiments. F Things that didn't work F.1 Hierarchical entity representation with binary codes

E Embeddings of Wikidata entities
Since imposing a hierarchy is a common method for dealing with large vocabulary sizes (Morin and Bengio, 2005) in general, and large inventories of entities and entity types in particular (Raiman and Raiman, 2018;López et al., 2019), we created a hierarchy of all entities in Wikidata, using a given entity's position in this hierarchy as training signal. Specifically, we created the entity hierarchy by fitting a KD-tree (Bentley, 1975;Virtanen et al., 2020) with leaf size 1 over pretrained entity embeddings, thereby obtaining a binary partitioning of the embedding space in which each final partition contains exactly one entity embedding. The path from the KD-tree's root to a leaf can be represented as a binary code, which we use as training signal (Oda et al., 2017). Memorization accuracy of world knowledge facts with object entities represented in the form of these binary codes was substantially lower compared to the three approaches described in the main part of this work.

F.2 Training entity embeddings with negative sampling
Instead of using fixed, pretrained entity embeddings as training signal, we experimented with randomly initialized embeddings that are updated during training, using between 1 and 50 in-batch negative samples, which is a standard method in the knowledge base embedding literature (Bordes et al., 2013) and has been used successfully for entity retrieval (Gillick et al., 2019). However, compared to using fixed, pretrained entity embeddings without negative sampling, we observed lower memorization accuracies and slower convergence in our experiments.

F.3 Updating pretrained entity embeddings during training
Instead of using fixed entity embeddings, we tried updating them during training with in-batch negative sampling. This increased the number of trainable parameters, memory usage, and training time, but did not lead to higher memorization accuracies.

F.4 Continuous representation with Euclidean distance loss
Instead of normalizing entity embeddings to the unit hypersphere and training with cosine loss, we experimented with predicting the original pretrained entity embeddings and using the Euclidean distance as loss. Compared to using spherical entity embeddings as prediction targets, we observed slower convergence and lower memorization accuracies.
G Impact of graph type on memorizability Figure 11: Impact of graph type on a model's ability to memorize the graph. We consider two types of random graphs, namely a uniform (Erdos-Renyi) graph, and a scale-free (Barabasi) graph. We interpret graph edges as relation triples in a knowledge graph and train models to predict the relation object, given subject and predicate, until memorization accuracy reaches 99 percent. For a given number of model parameters, we gradually increase the number of relation triples to memorizes and record the maximum number of relation triples memorized for this number of parameters. We compare an LSTM, as well as a bilinear KB embedding (DistMult). For a given parameter budget, models are able to memorize more triples from a Erdos-Renyi graph (blue) than from a Barabasi graph, indicating that the latter is more difficult to memorize.