MLMLM: Link Prediction with Mean Likelihood Masked Language Model

Knowledge Bases (KBs) are easy to query, verifiable, and interpretable. They however scale with man-hours and high-quality data. Masked Language Models (MLMs), such as BERT, scale with computing power as well as unstructured raw text data. The knowledge contained within those models is however not directly interpretable. We propose to perform link prediction with MLMs to address both the KBs scalability issues and the MLMs interpretability issues. To do that we introduce MLMLM, Mean Likelihood Masked Language Model, an approach comparing the mean likelihood of generating the different entities to perform link prediction in a tractable manner. We obtain State of the Art (SotA) results on the WN18RR dataset and the best non-entity-embedding based results on the FB15k-237 dataset. We also obtain convincing results on link prediction on previously unseen entities, making MLMLM a suitable approach to introducing new entities to a KB.

1 Introduction 1.1 Context KBs have many desirable properties. They are easy to query, verifiable, and perhaps most importantly human interpretable. They however have one critical shortcoming, they are expensive to build making them harder to scale. Indeed, modern KBs scale with high-quality data, manual labor, or a mix of both. Approaches that scale with available computation and the massive amounts of unstructured data that are being created and accumulated have proven invaluable in the recent deep learning boom.
Large pretrained MLMs have been shown to scale well with large amounts of unstructured text data as well as with computing power. They also have shown some interesting emergent abilities, such as the ability to perform zero-shot question answering (Radford et al., 2019). This ability implies that the model parameters contain a large amount of factual knowledge that it can leverage to answer a wide array of questions. That knowledge is, however, hardly interpretable by humans, as it is hidden within the hundreds of millions or even tens of billions of parameters of the language model.
In this paper, we are interested in exploiting MLMs for link prediction. Many attempts at leveraging language models to complete KBs already exist. They, however, either rely on handcrafted templates to query the model (Petroni et al., 2019), limiting the generalizability of the solution, or are intractable for any decently sized KB (Yao et al., 2019). They also generally cannot introduce new, previously unseen, entities to KB and therefore require human intervention to keep a KB up to date.

Motivation
By using MLMs to completes KBs, we can address both the issue of scalability of KBs and the issue of the interpretability of MLMs by committing knowledge of the latter to an interpretable format in the former. The MLM can learn new knowledge from the large amount of unstructured textual data that keeps being added to the World Wide Web and then be used to continually complete and update the KB. This will have the very desirable effect of making the link prediction approach scale with both computational power and a large quantity of unstructured data, both of which show no sign of slowing down.

Problem Definition
Simply put, we want to train an MLM to, given an entity and a relation, generate all entities completing the KB triplet.
Several technical blockades had to be broken to achieve proper link prediction with pretrained MLMs. The first one is tractability. The models being extremely large and expensive to perform inference on, it was necessary to enable link prediction with as little inference to the model as possible.
The second one has to do with the format of the MLMs inference outputs. The length of the output needs to be known at inference time, making it hard to sample entities of varying lengths from it. Work like Petroni et al. (2019) is limited to single token outputs, which serves well to probe the model for the presence of embedded knowledge, but is not usable in practice for tasks such as link prediction. Any approach has to be able to sample an MLM for entities of varying lengths to have practical applications. Finally, the usage of MLMs opens the door to performing link prediction on unseen entities. Some capability of such an approach with MLMs was previously demonstrated (Petroni et al., 2019). We show that our approach yields strong results with unseen entities of arbitrary lengths in this task and should be explored further.

Contribution
Our main contributions are summarized here: • We propose MLMLM, a mean likelihood method to compare the likelihood of different text of different token lengths sampled from an MLM. • We demonstrate the tractability of our approach, which was not previously done by an MLM based model on the link prediction task. • We achieve SotA results on the WN18RR benchmark and the best non entity-embedding based mean reciprocal rank on the FB15k-237 benchmark. • We demonstrate that our approach can generalize reasonably well to previously unseen entities on both benchmarks.  (Vaswani et al., 2017) to reconstruct the original text from the noisy inputs. Those models incorporate enormous amounts of language knowledge and world knowledge within their weights. This lets them be further tuned on challenging NLU tasks with great success.
Following on the footprints of BERT, several second-generation MLMs have been released. These models (Liu et al., 2019;Lan et al., 2020) have seen great improvements when compared to BERT on downstream tasks. Among other improvements to the original training process, these models were trained for much longer with much larger text corpus to achieve those results.
Being based on the transformer encoder architecture, the output length of the model is equal to the input length. This makes it challenging to sample text of arbitrary length when using MLMs without knowing the length of the desired sample in advance.

Link Prediction
Link prediction is the task of finding all potential entities that are in a specific relation with another entity. A knowledge graph (KG) is composed of a set of entities E, a set of relations R, and a set of valid triplets (h, r,t) representing the head entity h, the relation r and the tail entity t. By assigning a score to all possible triplets completing (h, r, ?) and (?, r,t), it is possible to rank all possible entities and thus complete the missing links within a KG.

A Re-evaluation of Knowledge Graph
Completion Methods Recently, Sun et al. (2020) has found that many of the SotA approaches to link prediction have used an inappropriate evaluation protocol. They have shown that the evaluation protocol typically used in the link prediction approaches assigns a perfect score to a constant output, by putting the correct entities on top during a tiebreaker. In essence, under this evaluation protocol, assigning a likelihood of 0 to all entities would yield a perfect reranking score, since the tiebreaker would put the target entity as the first prediction. This was shown to yield very inflated scores for many neural network based link prediction approaches (Nathani et al., 2019;Vu et al., 2019;Nguyen et al., 2017), as several of them output a large number of tied scores for the various entities. Entity embedding-based approaches (Balažević et al., 2019;Sun et al., 2019;Dettmers et al., 2018) do not suffer from this issue. While we have found that our approach does not suffer from this issue despite not being an embedding approach, we will use the random evaluation protocol proposed by Sun et al. (2020) for all evaluations and compare against approaches that used a similar protocol to ensure the validity of the comparisons. This protocol is similar to the filtered setting (Bordes et al., 2013), with the difference that the rank among entities with tied scores is randomly assigned.

KG-BERT
KG-BERT (Yao et al., 2019) is an approach to KB tasks based on MLM. It successfully demonstrates the potential of leveraging those models' internal knowledge on KB tasks. They train a BERT model to classify whether an individual triplet fed to the model is correct or not. In essence, they feed every single possible (h, r, ?) and (?, r, t) triplet to the model to obtain all scores to be reranked. This can result in millions of inference steps on the MLM for a single triplet completion. In contrast, our approach requires only one inference step through the MLM model for every triplet completion, by generating all logits required to obtain the likelihood of any potential entity. A comparison of the evaluation time is pictured in Figure   3 Methodology 3.1 Overview Our system performs link prediction. It uses MLM to generate all possible logits of all tokens required to rebuild all entities, and mean likelihood sampling to rerank all possible entities and perform the task. It can also be used to sample likelihoods for previously unseen entities. The system overview is as shown in Figure 3. These inputs are then passed to the trained language model to generate a lookup table. This lookup table is then used by the ranking system to assign a score to entity tokens based on their likelihood. These scores are then finally used to rank the entities, the highest-scoring ones being the best candidates to complete the link.

Data Pre-processing
The data pre-processing pipeline takes a link prediction dataset and transforms it into a generic format usable by the model. It is required that both the entity and relations have string representations. For every entity in the dataset, we extract an entity string, which uniquely identifies the entity, and a definition string, which is a textual description of the given entity. For every relation, we extract a relation string, which uniquely identifies and describes the relation.
We tokenize all strings through the pretrained RoBERTa tokenizer (Sennrich et al., 2016) and further transform the entity string by adding padding to match the longest tokenized entity within the dataset. Concretely, in a dataset where the longest entity has a length of 4 token ids, the entity string "dog" would be padded to have the representation "dog " and the entity string "cat and dog" would have the representation "cat and dog " where " " is the padding token. The purpose of this padding is to make the masked representation of all entities the same for the model, therefore letting the model treat all entities in the same manner.   Figure 1: Ranking System. The figure details the inner workings of the ranking system which uses the lookup table generated by the masked language model to compute the score associated with each possible entity. The scored entities are then ranked by highest score. Our approach uses the RoBERTa-Large model (Liu et al., 2019) for all experiments. It finetunes the pretrained model on the link prediction datasets to generate the logits of the unknown entities. As our approach does a single call to the model to rerank all possible entities, it is acceptable to use the larger model for better performance. Figure 4 shows the inference process for tail entity prediction. The head entity prediction would take as input the head entity mask, the relation, the tail entity and the tail entity definition. We use the relation string, the known entity string and the entity definition of the known entity string to make the model generate the logits representing the unknown entity string.

Ranking System
The ranking system pictured in Figure 1 performs link prediction on a given triplet. The MLM outputs logits for all possible token ids and positions for the missing entity to complete the triplet. This forms the lookup table T . The link prediction dataset contains a list of all possible entities. The token ids forming those entities make up E. We obtain the entity token logits L by matching all token ids in E with their corresponding values in T . L represents how likely every token of the entity was to be generated by the MLM at that specific position. The mean likelihood 1 of each entity is computed by averaging L over non-padded token logits 2 . This value is used to determine the ranking of the entity. It provides a proper comparison between entities of different lengths. Concretely, in our previous "cat and dog " ex- ample, we average the outputted logits for the "cat and dog" token ids and positions while ignoring the final padded logit. This averaging is done on all entities in the dataset completing the triplet, yielding the average likelihood assigned by the model to all entities. Entities are then sorted by highest rank using the randomized setting (Sun et al., 2020), meaning that for equal scores the tie-breaking is done randomly, to produce the ordered list of ranked entities R. We use the filtered setting (Bordes et al., 2013) for evaluation and remove corrupted triplets from the list of ranked entities, corrupted triplets being all other known correct triplets.

Datasets
The two datasets used are WN18RR and FB15K-237 (Bordes et al., 2013;Toutanova and Chen, 2015;Dettmers et al., 2017;Fellbaum, 1998;Bollacker et al., 2008), two commonly used link prediction benchmarks. Summary stats for both are shown in Table 1. WN18RR is a dataset composed of WordNet synsets. We use the cleaned synset as the entity string. The synset "dog.n.01" would have a string representation of "dog noun 1" which should be more interpretable by the model while remaining a unique identifier. The entity definition is the definition of the entity given by WordNet. The relation string is the cleaned relation. The relation " member of domain usage" would be represented with the string "member of domain usage". Full examples of inputs and outputs are shown in Listing 1 and Listing 2.
FB15k-237 is composed of triplets found in the now-defunct FreeBase KB, limiting itself to entities appearing in at least 100 triplets. We use the entity string and definitions as defined in Xie et al. (2016). We clean the relations to only include the words.

Metrics
We use the Mean Reciprocal Rank (MRR) metric to validate our model and select the best model. For all experiments, we also report the Mean Rank (MR), the Mean Precision at 1 (MP@1), the Mean Precision at 3 (MP@3), and the Mean Precision at 10 (MP@10).

Training
The training setup is a modified MLM training, where we let the model generate the missing entity. The previously mentioned padding lets us deal with the generation of entities of varying sizes. The input fed to the model for tail entity prediction, pictured in Figure 4, consists of the concatenated token ids of the head entity, the head entity definition, the relation and the tail entity mask. The model will then generate, in the place of the mask, the missing entity. The input fed to the model for head entity prediction is similar. An example of the input for head entity prediction is found in Listing 1 and an example for tail entity prediction is found in Listing 2.
We use the categorical cross-entropy loss to train the language model. The loss only depends on the non-padded token of the generated entity, ignoring all other outputs. The target is the actual entity completing the triplet, aligned with the mask in the input. We retain the model with the best validation MRR. All experiments are run for 5 random seeds and the mean and standard deviation of the results are reported.
For all experiments, we use the hyperparameters and training setup described in Liu et al. (2019) and shown in Table 3, with a total of 10 epochs for the FB15k-237 dataset and 25 epochs for the WN18RR dataset.

Unseen Entities
A secondary version of the dataset is made to test the generalization capacity of our methodology to unseen entities. For both datasets, we start by randomly sampling 5% of the entities for the validation entities and 5% of the entities for the testing entities. Our training set consists of all triplets not containing any of the validation or testing entities. Our validation set consists of all triplets containing the validation entities. Finally, our test set consists of all triplets containing the test entities, but not containing any of the validation entities. The training is done in the same fashion. The validation and testing are only done on entities present in the validation or test entity list. If the tail entity is the one present in the test entity list, we will complete the link (h, r, ?) and not the link (?, r,t). The reported results are therefore only on the performance of previously unseen entities in the KB. The validation and test set are rebuilt for every random seed, to evaluate our approach on a wider array of unseen entities.

WN18RR
We achieve SotA results on the WN18RR dataset on all tested metrics with the exception of MR, shown in Table 2. The WN18RR dataset is sparse in terms of relations, see Table 1. This sparseness lends itself naturally to leveraging a pretrained model, since the amount of information that can be extracted from the dataset on any given entity is limited, which makes outside information all the more valuable. We can observe a large discrepancy between the MP@1 and MP@3 metrics, implying that the model will have the correct answer in its top 3 much more often than within its top 1. This could be explained by an issue of disambiguation in the name of the entity. While approaches using entity em-beddings (Balažević et al., 2019;Sun et al., 2019;Dettmers et al., 2018) will have no issue separating the synsets dog.n.01 and dog.n.03 as meaning respectively "a member of the genus Canis [...]" and "informal term for a man", our model will have to discern between those two meanings only by the digit appended to the name. It is probable that the model is often confused about whether it should generate dog noun 1 or dog noun 3, having only the final digit to differentiate both of them. An example of such an error is shown in Listing 2, where the model confuses aid.n.01 and aid.n.03. Follow up work on better representations for entity names could yield stronger results.
We performed some quantitative and qualitative error analysis to understand some of the remaining shortcomings of our approach. It seems like our model generally has a much easier time predicting the tail entity than the head entity, having an MRR of 0.6015 on tail entities and an MRR of 0.4009 on head entities. By observing the instances where our model gives the worst rank to the correct answer, we can understand why. A large number of those cases are hypernyms on the head entity. The definition of a hypernym is as follows: "A hypernym of something is its superordinate term: if X is a hypernym of Y, then all Y are X." (Fellbaum, 1998). An example of a hypernym relationship would be: "animal is an hypernym of dog, since all dogs are animals." Correctly ranking all possibilities for "X is an hypernym of dog." seems easier for the model to do than correctly ranking all possibilities for "Animal is an hypernym of Y.". An example of such failure is shown in Listing 1, where we look for the hypernym of the term mediator. It is clear Listing 1: Example of an error of the model on WN18RR. Shown are the top 5 ranked entities by the model with the score assigned to them. The correct answer, matchmaker noun 1, was ranked 14,108 by the system. Listing 2: Example of a disambiguation error of the model on WN18RR. Shown are the top 5 ranked entities by the model with the score assigned to them. The correct answer, aid noun 3, was ranked second by the system, after aid noun 1.  The results are reported as <mean> ± <standard deviation>. Results for other models are taken from Sun et al. (2020).
that the model understands the concept and outputs plausible answers in its top 5. A large amount of the model's severe failure cases are similar to this one, where the model will output a plausible hypernym of the tail entity, while completely missing the targeted hypernym.

FB15K-237
The results on FB15k-237 shown in Table 4 are, comparatively to the results obtained on WN18RR, fairly weak. FB15k-237 is very dense and contains a lot more training examples than the WN18RR dataset for a smaller amount of entities. Thus, nonpretrained models have way more examples to learn from in the dataset, which makes the learned information of pretrained models comparatively less impactful. This implies, non-surprisingly, that our approach heavily relies on the pre-training of the model and that it is less adept than other specialized approaches at learning from dense link prediction datasets.
However, FB15k-237 is an especially dense section of the FreeBase dataset, being composed of only entities containing a minimum of 100 relations, and is thus not representative of the KB as a whole. In practice, KB completion will often be used on entities rarely or never seen within the KB. While our FB15k-237 results are not SotA when compared to all approaches, the MRR however compares favorably to all other non entityembedding approaches on the randomized setting. The results are reported as <mean> ± <standard deviation>. The results are reported as <mean> ± <standard deviation>.

Unknown Entities Experiments
We demonstrate the capacity of our approach to generalize to unknown entities. Results for the WN18RR and the FB15k-237 datasets are shown in Table 5 and Table 6. For baselines, we use a random baseline, reranking the entities randomly, as well as a non-finetuned RoBERTa-large model, that simply generates the entity tokens without being finetuned on the dataset first. We can notice that while our approach outperforms a non-finetuned benchmark, the nonfinetuned RoBERTa model still far outperforms the random baseline, supporting some of the findings of Petroni et al. (2019) in the capacity of MLM to perform unsupervised link prediction.
It is to be noted that the high standard deviation of the results in this set of experiments comes from the fact that the validation and test entities are resampled with a different random seed on every run, yielding more variability in the results.
We are unaware of other approaches that can generalize to unknown entities of arbitrary size in the task of link prediction. We believe that leveraging MLMs could eventually lead to automatically populating KBs with new entities, as new knowledge and new facts are created and added to the web.

Limitations
MLMLM comes with several limitations. Our approach to padding limits the size of an unknown entity to the size of the longest known entity. While it is likely to not be limiting in practice, it is still a weakness of our approach to sampling. The model size can be very prohibitive and specialized hardware such as GPUs is required to run it in a timely fashion. The approach however remains tractable as it can provide likelihoods for all possible entities in a single inference call. Compared to entityembedding based methods, our approach needs additional information in the form of meaningful string representations for both entities and relations. Entity disambiguation is also a limiting factor that does not affect other approaches.

Conclusion
We have developed a methodology for training masked language models to perform link prediction. By leveraging the natural language understanding abilities of these models as well as the factual knowledge embedded within their weights, we have achieved a tractable approach to link prediction that yields state of the art results on a standard benchmark and the best non entity-embedding based results on another. We have also demonstrated the ability of our model to perform link prediction of previously unseen entities, making our approach suitable to introduce new entities to knowledge bases. More generally, we have introduced an approach to sampling text from a masked language model of varying lengths, which can have a wider use case.