OKGIT: Open Knowledge Graph Link Prediction with Implicit Types

Open Knowledge Graphs (OpenKG) refer to a set of (head noun phrase, relation phrase, tail noun phrase) triples such as (tesla, return to, new york) extracted from a corpus using OpenIE tools. While OpenKGs are easy to bootstrap for a domain, they are very sparse and far from being directly usable in an end task. Therefore, the task of predicting new facts, i.e., link prediction, becomes an important step while using these graphs in downstream tasks such as text comprehension, question answering, and web search query recommendation. Learning embeddings for OpenKGs is one approach for link prediction that has received some attention lately. However, on careful examination, we found that current OpenKG link prediction algorithms often predict noun phrases (NPs) with incompatible types for given noun and relation phrases. We address this problem in this work and propose OKGIT that improves OpenKG link prediction using novel type compatibility score and type regularization. With extensive experiments on multiple datasets, we show that the proposed method achieves state-of-the-art performance while producing type compatible NPs in the link prediction task.


Introduction
An Open Knowledge Graph (OpenKG) is a set of factual triples extracted from a text corpus using Open Information Extraction (OpenIE) tools such as TEXTRUNNER (Banko et al., 2007) and ReVerb (Fader et al., 2011). These triples are of the form (noun phrase, relation phrase, noun phrase), e.g., (tesla, return to, new york). An OpenKG can be viewed as a multi-relational graph where the noun phrases (NPs) are the nodes, and the relation phrases (RPs) are the labeled edges between pairs of nodes. It is easy to bootstrap OpenKGs from a domain-specific corpus, making  them suitable for newer domains. However, they are extremely sparse and may not be directly usable for an end task. Therefore, tasks such as NP canonicalization (merging mentions of the same entity) and link prediction (predicting new facts) become an important step in downstream applications. Some example applications are text comprehension (Mausam, 2016), relation schema induction (Nimishakavi et al., 2016), canonicalization (Vashishth et al., 2018), question answering (Yao and Van Durme, 2014), and web search query recommendation (Huang et al., 2016). In this work, we focus on improving OpenKG link prediction.
Although OpenKGs are structurally similar to Ontological KGs, they come with a different set of challenges. They are extremely sparse, NPs and RPs are not canonicalized, and no type information is present for NPs. There has been much work on learning embeddings for Ontological KGs in the past years. However, this task has not received much attention in the context of OpenKGs. CaRE (Gupta et al., 2019) is a recent method which addresses this problem. It learns embeddings for NPs and RPs in an OpenKG while incorporating NP canonicalization information. However, even after incorporating canonicalization, we find that CaRE struggles to predict NPs whose types are compati-ble with given head NP and RP.
As observed by Petroni et al. (2019), modern pretrained language representation models like BERT can store factual knowledge and can be used to perform link prediction in KGs. However, in our explorations with OpenKGs, we found that even though BERT may not predict the correct NP on the top, it predicts type compatible NPs (Table 1). A similar observation was also made in the context of entity linking . As OpenKGs do not have any underlying ontology and obtaining type information can be expensive, BERT predictions can help improve OpenKG link prediction.
Motivated by this, we employ BERT for improving OpenKG link prediction, using novel type compatibility score (Section 4.2) and type regularizer term (Section 4.4). We propose OKGIT, a method for OpenKG link prediction with improved type compatibility. We test our model on multiple datasets and show that it achieves state-of-the-art performance on all of these datasets.
We make the following contributions: • We address the problem of OpenKG link prediction, focusing on improving type compatibility of predictions. To the best of our knowledge, this is the first work that addresses this problem.
• We propose OKGIT, a method for OpenKG link prediction with novel type compatibility score and type regularization. OKGIT can utilize NP canonicalization information while improving the type compatibility of predictions.
• We evaluate OKGIT on the link prediction across multiple datasets and observe that it outperforms the baseline methods. We also demonstrate that the learned model generates more type compatible predictions.
Source code for the proposed model and the experiments from this paper is available at https: //github.com/Chandrahasd/OKGIT.

Related Work
OpenKG Embeddings: Learning embeddings for OpenKGs has been a relatively under-explored area of research. Previous work using OpenKG embeddings has primarily focused on canonicalization. CESI (Vashishth et al., 2018) uses KG embedding models for the canonicalization of noun phrases in OpenKGs. The problem of incorporating canonicalization information into OpenKG embeddings was addressed by Gupta et al. (2019). Their method for OpenKG embeddings (i.e., CaRE) performs better than Ontological KG embedding baselines in terms of link prediction performance. The challenges in the link prediction for OpenKGs were discussed in Broscheit et al. (2020), and methods similar to CaRE were proposed. In spirit, CaRE (Gupta et al., 2019) comes closest to our model; however, they do not address the problem of type compatibility in the link prediction task.
Entity Type: Entity typing is a popular problem where given a sentence and an entity mention, the goal is to predict explicit types of the entity. It has been an active area of research, and many models and datasets, such as (Mai et al., 2018), (Hovy et al., 2006), and (Choi et al., 2018), have been proposed. However, unlike this task, we aim to incorporate unsupervised implicit type information present in the pre-trained BERT model into OpenKG embeddings, rather than predicting explicit entity types present in ontologies or corpora.
For unsupervised cases, the problem of type compatibility in link prediction was addressed in . They employ a type compatibility score by learning a type vector for each NP and two type vectors (head and tail) for each relation. This score is multiplied with the triple score function, and the type vectors are trained jointly with embedding vectors. Although their method addresses the type compatibility issue, it is based on Ontological KG embedding models and shares the same limitations. In another work (Xie et al., 2016), hierarchical type information available in the dataset is incorporated while learning embeddings. However, their model is suitable only for Ontological KGs where the type information is readily available.
BERT in KG Embedding: BERT architecture has been used for scoring KG triples (Yao et al., 2019;Wang et al., 2019). However, their methods work on Ontological KGs without any explicit attention to NP types. In other work (Petroni et al., 2019), pre-trained BERT models are used for predicting links in KG. However, their focus was to evaluate knowledge present in the pre-trained BERT models instead of improving the existing link prediction model. BERT embeddings were also used for extracting entity type information . However, it was used for Entity Linking compared to OpenKG link prediction in our case.   Figure 1: OKGIT Architecture. OKGIT learns embeddings for Noun Phrases (NP) and Relation Phrases (RP) present in an OpenKG by augmenting a standard tail prediction loss with type compatibility loss. Guidance for the tail type is obtained through type projection out of BERT's tail embedding prediction. In the figure, h, r, and t are the head NP, relation (RP), and tail NP. h = w h 1 . . . w h k h and r = w r 1 . . . w r kr are tokens in the head NP and relation, respectively. t C and t B are the tail NP vectors predicted by CaRE and BERT models (Please see Section 3 for background on these two models). Vectors τ B and τ are the type vectors obtained using type projections P B and P , respectively. ψ PRED represents tail prediction score (Section 4.1) while ψ TYPE represents type compatibility score (Section 4.2). ψ OKGIT is the combined score generated by OKGIT for the input triple (h, r, t) (Section 4.3). Please refer to Section 4 for more details.

Background
We first introduce the notation used in this paper, followed by brief descriptions of BERT and CaRE. Notation: An Open Knowledge Graph OpenKG = (N , R, T ) contains a set of noun phrases (NPs) N , a set of relation phrases (RPs) R and a set of triples (h, r, t) ∈ T where h, t ∈ N and r ∈ R.
Here, h and t are called the head and tail NPs, and r is the RP between them. Each of them contains tokens from a vocabulary V, specifically, h = (w h 1 , w h 2 , . . . , w h k h ), t = (w t 1 , w t 2 , . . . , w t kt ) and r = (w r 1 , w r 2 , . . . , w r kr ). Here, k h , k r , and k t are the numbers of tokens in the head NP, the relation, and the tail NP. OpenKG embedding methods learn vector representations for NPs and RPs. Specifically, vectors for an NP e ∈ N and an RP r ∈ R are represented by boldface letters e ∈ R de and r ∈ R dr . Here, d e and d r are dimensions of NP and RP vectors. Usually, d e = d r . A score function ψ(h, r, t) represents the plausibility of a triple. Similarly, BERT represents tokens by d Bdimensional vectors. A type projection matrix P takes the vectors to a common d τ -dimensional type space R dτ . The vectors in the type space are denoted by τ . BERT (Devlin et al., 2019): BERT is a bidirectional language representation model based on the transformer architecture (Vaswani et al., 2017), which has shown performance improvements across multiple NLP tasks. It is pre-trained on two tasks, (1) Masked Language Modeling (MLM), where the model is trained to predict randomly masked tokens from the input sentences, and (2) Next Sentence Prediction (NSP), where the model is trained to predict whether an input pair of sentences occurs in a sequence or not. In our case, we use a pre-trained BERT model (without fine-tuning) for predicting a masked tail NP in a triple. CaRE (Gupta et al., 2019): CaRE is an OpenKG embedding method that can incorporate NP canonicalization information while learning the embeddings. NP canonicalization is the problem of grouping all surface forms of a given entity in one cluster, e.g., inferring that Barack Obama, Barack H. Obama, and President Obama all refer to the same underlying entity. CaRE consists of three components: (1) a canonicalization cluster encoder (CN), which generates NP embeddings by aggregating embeddings of canonical NPs from the corresponding cluster, (2) a bi-directional GRU based phrase encoder (PN), which encodes the tokens in RPs to generate RP embeddings, and (3) a base model, which is an Ontological KG embedding method like ConvE (Dettmers et al., 2018). It uses NP and RP embeddings for scoring triples. These triple scores are then fed to a loss function (e.g., pairwise ranking loss with negative sampling (Bordes et al., 2013) or binary cross-entropy loss (BCE) (Dettmers et al., 2018)). In this paper, we use CaRE with ConvE as the base model. This model generates a candidate tail NP vector for a given NP h and RP r, denoted by CaRE(h, r).

OKGIT: Our Proposed Method
Motivation: As illustrated in Table 1, top NPs predicted by CaRE may not always be type compatible with the input query. On the other hand, BERT's top predictions are usually type compatible , although they may not be factually correct. Thus, we hypothesize that a combination of these two models can produce correct as well as type compatible predictions. Motivated by this, we develop OKGIT, which combines the best of both of these models. The complete architecture of the proposed model can be found in Figure 1. In the following section, we present various components of the proposed model.

ψ PRED : Tail Prediction Score
The correctness of tail prediction in a triple is measured by the triple score function ψ PRED . Given a triple (h, r, t), it uses the corresponding vectors (h, r, t) and assigns high scores to correct triples and low scores to incorrect triples. We follow CaRE (Gupta et al., 2019) for scoring triples, which internally uses ConvE (Dettmers et al., 2018) as the base model. For a given triple (h, r, t), the CaRE model first predicts a tail NP vector t C as The predicted tail NP vector t C is then matched against the given tail NP vector t using dot product to generate the triple score ψ PRED .
The score ψ PRED represents tail prediction correctness, and CaRE model uses only this score.

ψ TYPE : Tail Type Compatibility Score
The type compatibility between a given (head NP, RP) pair and a tail NP is measured by the type compatibility score function ψ TYPE . It assigns a high score when an NP t has suitable types as candidate tail NP for given head NP h and RP r. We employ a Masked Language Model (MLM) for measuring type compatibility, specifically BERT (Devlin et al., 2019). Following (Petroni et al., 2019), we can generate a candidate tail NP vector using BERT. Specifically, given a triple (h, r, t), we replace the head NP h and RP r with their tokens and tail NP t with a special MASK token. The resulting sentence (w h 1 , . . . , w h k h , w r 1 , . . . , w r kr , MASK) is sent as input to the BERT model. We denote the output vector from BERT corresponding to the MASK tail token as t B .
We can predict tail NPs for a given (h, r) by finding the nearest neighbors of t B from the BERT vocabulary (Appendix D). These predicted NPs may not be the correct tail NP present in KG; however, they tend to be type compatible with the given (h, r) pair.
Motivated by this, we extract the implicit NP type information from this vector using a type projector P B ∈ R dτ ×d B . The output vector from BERT t B is high-dimensional and can be used as a proxy for NP's type . Therefore, P B projects the t B vector to a lower dimensional space such that only relevant information is retained. We do a similar operation on tail NP embedding t and use a type projector P ∈ R dτ ×de to extract type information. Both P B and P are trained jointly with the model. Thus, the type vectors are given by for BERT and CaRE, respectively. Here, both τ B , τ ∈ R dτ . Then, the type compatibility score between these can be measured by negative of Euclidean distance, i.e., We also experimented with a dot product version of the type score, ψ Dot TYPE (τ , τ B ) = τ B τ , and found its performance to be comparable to the Euclidean distance version. Therefore, we use the Euclidean distance version for all our experiments.

ψ OKGIT : Final Composite Score
The score functions ψ PRED and ψ TYPE may contain complementary information. Therefore, we use a combination of triple and type compatibility scores as final score for a given triple.  Please recall that t C and τ B are in turn dependent on h and r ( (1) and (3)), while τ is dependent on t (4). Here, γ controls the relative weights given to individual scores. This final score takes care of both, i.e., triple correctness as well as type compatibility. For training, we feed the sigmoid of this score function to the Binary Cross Entropy (BCE) loss function following (Dettmers et al., 2018).

Learning with Type Regularization
be the set of all head NPs and RPs which appear in the OpenKG. Let y i be the label for the triple (h i , r i , t i ) which is 1 if (h i , r i , t i ) ∈ T and 0 otherwise. We apply the logistic sigmoid function σ on score ψ OKGIT to get the predicted label Finally, we use the following binary cross-entropy (BCE) loss for triple correctness.
To further reinforce the type compatibility in the model, we include an additional loss term which forces the type vectors of correct triples to be closer in the type space. Similar to TripleLoss, we use the binary cross-entropy loss for type regularization as well. The type regularization term is shown below.
The cumulative loss function is then given as below.
where n is the number of training instances. We consider X × N as our training data where triples present in T have label 1 and rest have label 0.

Experiments
Datasets: Following (Gupta et al., 2019), we use two subsets of English OpenKGs created using Re-Verb (Fader et al., 2011), namely ReVerb20K and ReVerb45K. We follow the same train-validationtest split for these datasets. As noted in (Petroni et al., 2019), predicting multi-token NPs using BERT could be challenging and it might require special pre-training (Joshi et al., 2020). To understand this difference, we create filtered subsets of these datasets such that they contain only single token NPs 1 . Specifically, we create Re-Verb20KF (ReVerb20K-Filtered) and ReVerb45KF (ReVerb45K-Filtered) which contain only single token NPs. More details about these datasets can be found in Table 2. Setup and hyperparameters: We use d e = d r = 300 for NP and RP vectors. For other hyperparameters, we use grid-search and select the model based on MRR on validation split. For type vectors, we select d τ from {100, 300, 500}. The weight for type regularization term λ is selected from the range {10 −3 , 10 −2 . . . , 10 1 } ∪ {0}. Type composition weight γ is selected from {0.25, 0.5, 1.0, 2.0, 5.0}. For the language model, we try both BERT-base as well as BERT-large. The optimal values for hyperparameters are shown in Table 3. The experiments run for 1.5 hours (for filtered subsets) and 9 hours (for full datasets) on GeForce GTX 1080 Ti GPU.

Results
We evaluate the proposed model on the link prediction task. We follow the same evaluation process as in (Gupta et al., 2019). From our experiments, we try to answer the following questions: 1. Is OKGIT effective in the link prediction task? (Section 6.1)  3. Is the Type Projector effective in extracting type vectors from embeddings? (Section 6.3)

Effectiveness of OKGIT Embeddings in Link Prediction
We evaluate our model on the link prediction task. Given a held-out triple (h i , r i , t i ), all the NPs e ∈ N in the KG are ranked as candidate tail NP based on their score ψ OKGIT (h i , r i , e). Let the rank of the correct tail NP t be denoted by rank t i . Similarly, ranks are also calculated for predicting head NPs instead of tail NPs using inverse relations (Dettmers et al., 2018;Gupta et al., 2019); let it be denoted by rank h i . These ranks are then used to find Mean Reciprocal Rank (MRR), Mean Rank (MR) and Hit@k (k=1,3,10) as follows.
Here, n test is the number of test triples and 1 is the indicator function. As noted in (Gupta et al., 2019), ranking individual NPs is not suitable for OpenKGs due to the lack of canonicalization. Hence, following their approach, we rank gold canonicalization clusters instead of individual NPs. The gold canonicalization partitions the NPs into clusters such that NPs mentioning the same entity belong to the same cluster. For ranking these clusters, we first find ranks of all NPs e ∈ N . Then for each cluster, we keep the NP with minimum rank as representative and discard others. The representative NPs are then ranked again and the new ranks are assigned to the corresponding clusters. The rank of the cluster containing the true NP is then used for evaluating the performance. For better readability, the MRR and Hits@k metrics have been multiplied by 100. We compare OKGIT with BERT (MLM), ConvE (Ontological KGE) and CaRE (OpenKGE). We also compare against a version of CaRE where phrase embeddings have been initialized with BERT (CaRE [BERT initialization]). As we can see from the results in Table 4, the proposed model OKGIT outperforms baseline methods in link prediction task across all datasets. This suggests that the implicit type scores from BERT help in improving ranks of correct NPs. Moreover, OKGIT outperforms CaRE with BERT initialization, suggesting the importance of type projectors 2 .
The performance gain is higher for ReVerb20K and ReVerb20KF (+5.3 MRR) than ReVerb45K and ReVerb45KF (+1.2 and +3.1 MRR) datasets. As we can see from Table 2, the number of NPs are very close to the number of gold clusters in the 20K Figure 2: Effect of type compatibility score and type regularization on link prediction performance. While the type compatibility score with λ = 0 gives better gains in MRR (11%-12%) than type regularization term with γ = 0 (7%-11%), the combined model performs the best, achieving 12%-18% gains in MRR (Section 6.1).
datasets. Thus, the canonicalization information is slightly weaker in the 20K datasets than the 45K datasets. Due to this, CaRE achieved better gains in the ReVerb45K dataset as noted in (Gupta et al., 2019). This leaves more scope of improvements in the 20K datasets. By including the type information from BERT, OKGIT is able to fill this gap. It achieves better gains in the 20K datasets and is able to alleviate the lack of canonicalization information. Moreover, OKGIT is able to improve ranks of correct NPs ranked lower by CaRE. This can be seen by significant improvements in the MR. Other Language Models: Using RoBERTa instead of BERT results in similar performance improvements (Appendix B). However, our primary focus is to understand the impact of implicit type information present in pre-trained MLMs, such as BERT, and not to compare multiple MLMs themselves. Ablations: We perform ablation experiments to compare the relative importance of type compatibility score ψ T Y P E and type regularization term. We evaluate OKGIT with disabled type compatibility score (i.e., γ = 0 in Equation (5)) and disabled type regularization term (i.e., λ = 0 in Equation (6)) separately. Please note that CaRE model is equivalent to OKGIT with γ = 0 and λ = 0. The results of this experiment are shown in Figure 2. We find that while type compatibility score gives more performance gain (11%-12% gain in MRR) than type regularization (7%-11% gain in MRR),  Table 5: Results of type evaluation in CaRE and OKGIT predictions. We find that OKGIT performs better than CaRE in all datasets in terms of F1-score. Also, the results are statistically significant for all the datasets (Section 6.2).
the combined model achieves the best performance (12%-18% gain in MRR). It suggests that both the components are important. Please refer to the Appendices A, B, C for more ablation experiments.

Type Compatibility in Predicted NPs
As noted in , BERT vectors contain NP type information 3 . OKGIT utilizes this type information for improving OpenKG link prediction. In this section, we evaluate whether OKGIT improves upon CaRE in predicting type compatible NPs. For such an evaluation, we require type annotations for the NPs in the OpenKGs. However, OpenKGs do not have an underlying ontology or explicit gold NP type annotations, making a direct evaluation impossible. Therefore, we employ a pre-trained entity typing model UFET (Choi et al., 2018). Given a sentence and an entity mention, the entity typing model predicts the mentioned entity's types. Using this model, we obtain types for true as well as predicted NPs by CaRE and OKGIT and use it for the evaluation. Please note that this evaluation is limited to the coverage and quality of the UFET model. Evaluation Protocol: The type vocabulary in UFET model contains 10, 331 types including 9 general, 121 fine-grained, and 10, 201 ultra-fine types. The model takes a sentence (w h 1 , . . . , w h k h , w r 1 , . . . , w r kr , w t 1 , . . . , w t kt ) formed from a triple (h, r, t) along with an entity mention (either t or h) as inputs and outputs a distribution over types. We use the top five predicted types for our experiments 4 . For a triple (h, r, t), we consider the types predicted for the true tail NP t as true types Γ(t). Lett CaRE andt OKGIT be the top predicted tail NP by CaRE and OKGIT for the (h, r) pair. Then the types Γ(t CaRE ) predicted fort CaRE Figure 3: t-SNE projections of tail NP embeddings (left) and type vectors (right) extracted by the Type Projector from tail NP embeddings (Section 4.2) in the ReVerb20K dataset. We find that the Type Projector is able to extract informative type vectors from the tail embeddings. This is evident from the fact that the tail embeddings corresponding to person, location, and dates were inter-mixed in the left plot, while they have been separated into type specific clusters in the right plot. Please see Section 6.3 for details.
in the triple (h, r,t CaRE ) is used as predicted types for CaRE. Similarly the types Γ(t OKGIT ) predicted fort OKGIT in the triple (h, r,t OKGIT ) are used as predicted types for OKGIT. For evaluation, we calculate the mean F1-score as follows 5 Here, |Γ(t)| denotes the number of types present in Γ(t) andt representst CaRE ort OKGIT . We can obtain the F1-scores for head NP similarly. We evaluate the mean F1-scores across head and tail NP prediction tasks on the test data and compare CaRE with OKGIT.
As we can see from the results in Table 5, OKGIT performs better than CaRE, suggesting that OKGIT generates more type compatible NPs than CaRE in the link prediction task. OKGIT achieves higher gains in the single-token datasets (i.e., ReVerb20KF and ReVerb45KF) than multitoken dataset (i.e., ReVerb20K and ReVerb45K). Upon investigation, we found that the types obtained using the entity typing model (true as well as predicted) for the multi-tokens datasets often contain common noisy types, leading to the small difference between CaRE and OKGIT. Following Dror et al. (2018), we also check the results for sta-tistical significance using Permutation, Wilcoxon, and t-test with α = 0.05, and found it to be significant for all the datasets.

Effectiveness of Type Projector
To better understand the effect of type projection, we visualize the vectors in NP-space from CaRE and Type-space (i.e., after type projection) from OKGIT. For this experiment, we randomly select 5 NPs from 3 categories, namely Person, Location and Year. More details about this selection process can be found in the Appendix E. We project the NP vectors (i.e., t) corresponding to these NPs to a 2-dimensional NP-space using t-SNE (Maaten and Hinton, 2008) 6 . Similarly, we also project the corresponding type vectors (i.e., τ ) to 2-dimensional Type-space. We plot the resulting vectors, color and shape coded by their respective categories, in Figure 3.
We can see that the vectors from different categories in the NP-space are mixed. However, after the type projection, the vectors in the Type-space are clustered together based on their categories.

Qualitative Evaluations
In this section, we present some examples of predictions made by CaRE and OKGIT methods. The result is shown in Table 6. As we see in Triple-1, both CaRE and OKGIT predict the correct NP (i.e.,  We see similar patterns in Triple-2, where the correct tail NP should be of type number indicating the count of votes. OKGIT is able to predict numbers in top predictions for Triple-2, while CaRE has mixed types in top predictions.

Conclusion
The task of link prediction for Open Knowledge Graphs (OpenKG) has been a relatively underexplored research area. Previous work on OpenKG embeddings has primarily focussed on improving or incorporating NP canonicalization information. While there are few methods for OpenKG link prediction, they often predict noun phrases with types incompatible with the query noun and relation phrases. Therefore, we use implicit type information from BERT to improve OpenKG link prediction and propose OKGIT. With the help of novel type compatibility score and type regularization term, OKGIT achieves significant performance improvement on the link prediction task across multiple datasets. We also find that OKGIT produces more type compatible predictions than CaRE, evaluated using an external entity typing model.

Broader Impact
OKGIT is the first attempt towards incorporating implicit type information in OpenKG link prediction without human intervention. It will greatly benefit densification and applications of OpenKGs where no underlying ontologies are available.
However, OKGIT predictions depend on various datasets, i.e., the corpus used for training the masked language model (e.g., BERT) and the corpus from which the OpenKG triples were extracted. A potential, possibly undesirable, bias may be introduced in the predictions by manipulating these corpora or adding a large number of malicious triples in the OpenKG.
We have tested OKGIT in English datasets. While the overall model architecture is independent of the language, the model's effectiveness might vary depending upon the quality of the masked language model, and it needs to be tested.

A BERT Initialization vs Type Projectors
Here, we demonstrate the importance of type projectors by comparing OKGIT with multiple BERTaugmented versions of CaRE. Specifically, we initialize the phrase and word embeddings in CaRE with a pre-trained BERT model. The phrase (word) is passed as input to BERT and the output corresponding to the [CLS] token is then used for initializing phrase ( In all the methods, including OKGIT, we never fine-tune BERT, as our goal is to evaluate the type information already present in pre-trained BERT model. We experiment with both, BERT-base and BERT-large, and report the best performing model.
As we can see from the results in Table 7, OKGIT outperforms these baselines. Although BERT initialization improves the performance of CaRE model, the usage of explicit type-score and type regularization leads to significant performance improvements, suggesting their importance.

B Replacing BERT with other operations
In this section, we evaluate whether BERT module in OKGIT can be replaced by simple operations such as vector addition and concatenation. Specifically, we modify t B in Equation (3) by replacing BERT with these operations leading to the following variants of OKGIT. OKGIT-C: BERT is replaced by concatenation of 7 we also tried using pre-trained BERT as RP encoder in CaRE, however, it performed poorly due to fixed RP encoder. head NP vector h and relation phrase vector r t B = [h; r].

OKGIT-A: BERT is replaced by vector addition
OKGIT-R: We also experiment with another masked language model RoBERTa in place of BERT.
For this experiment, we use the ReVerb20KF and ReVerb45KF datasets as representatives. We perform grid-search with similar hyper-parameters as in Section 5 of the main paper and select the best model based on the MRR on the validation split. The results are reported in Table 8.
As we can see from the results, the OKGIT-C and OKGIT-A perform very similar to CaRE on both datasets. This suggests that the performance gains for OKGIT come from the BERT module. This observation is further reinforced because OKGIT-R results in similar improvements compared to CaRE as OKGIT. However, in all cases, we find that OKGIT with BERT outperforms other model variants.

C CaRE with Entity Typing
Entity typing is the task of predicting explicit types of an entity given a sentence and its mention. As we are interested in improving type compatibility of predictions in the link prediction task, we can also incorporate the output from an entity typing model. In this section, we explore this setting by replacing the BERT module in OKGIT with an entity typing model UFET from (Choi et al., 2018). Specifically, we replace the vector t B in Equation (3) with the output of UFET representing the predicted probability distribution over types. OKGIT(UFET) Model: The UFET model takes a sentence and an entity mention as input and produces a distribution over explicit set of types. In our case, the sentence is formed by concatenating the subject NP, relation phrase, and object NP, while the object NP is used as mention. The output distribution from UFET is used as t B in our model. We call this version of the model as OKGIT(UFET) and compare it CaRE and OKGIT.
We run a grid-search for finding the best hyperparameter similar to Section 5 and report the results    This limitation, however, is not valid for OKGIT. In OKGIT, the vector t B is used for computing tail type compatibility score, instead of predicting tail NPs. Therefore, it is not restricted to BERT vocabulary or single-token NPs. As shown in Table 4, OKGIT is equally effective for single-token  datasets (e.g., ReVerb20KF and ReVerb45KF) and multi-token datasets (e.g., ReVerb20K and Re-Verb45K).

E Selection of NPs for t-SNE
The OpenKGs do not have type annotations for the NPs. Therefore, we manually annotated a set of NPs and visualized a random subset. For this process, we first list all the NPs and shuffle them. Then we scan this list and note the first fifteen person names, locations, and years. Later, we select five NPs from each of these categories randomly and use them for the evaluation.

F Link Prediction Performance on Validation Split
The performance of CaRE and OKGIT on validation data on the link prediction task can be found in Table 10. These performance corresponds to the respective models which were used to report results in Table 4 of the main paper.

G Type Information in BERT Predictions
Our proposed OKGIT model is based on the hypothesis that BERT vectors (i.e., t B in Equation (3) in Section 4.2) contain implicit type information.
In this section, we evaluate this hypothesis that BERT vectors contain type information. It should be noted that evaluating OKGIT model for predicting NP types is not the goal here. We are interested in understanding whether pre-trained BERT vectors have sufficient type information, measured with respect to some existing anchors. Evaluation Method: For this experiment, we use Freebase (Bollacker et al., 2008) which contains explicit gold type information for entities. Specifically, we use FB15K dataset (Bordes et al., 2013). We use the data from (Yao et al., 2019) for converting symbolic names in FB15k to textual descriptions. We only consider the subset of triples in FB15k which has single token in the tail node as BERT can only predict single token NPs. 8 This results in n T = 95, 782 triples. For type information, we use the data from (Xie et al., 2016). It contains 61 primary types (e.g., /award). Please note that each node in FB15k can have multiple types. For a triple (h, r, t), we consider the types associated with the true tail NP t as true types Γ(t). We then pass tokenized head NP and RP to BERT and find the top predictiont = BERT(h, r, MASK) for tail position. The set of types associated with the predicted NPt, denoted by Γ(t), is then used as the predicted types. For evaluation, we calculate the following metrics Here, |Γ(t)| and |Γ(t)| denotes the number of types present in Γ(t) and Γ(t) respectively. 9 For comparison, we use the following baseline methods to assign types to a given (h, r, t).