Are Missing Links Predictable? An Inferential Benchmark for Knowledge Graph Completion

We present InferWiki, a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns. First, each testing sample is predictable with supportive data in the training set. To ensure it, we propose to utilize rule-guided train/test generation, instead of conventional random split. Second, InferWiki initiates the evaluation following the open-world assumption and improves the inferential difficulty of the closed-world assumption, by providing manually annotated negative and unknown triples. Third, we include various inference patterns (e.g., reasoning path length and types) for comprehensive evaluation. In experiments, we curate two settings of InferWiki varying in sizes and structures, and apply the construction process on CoDEx as comparative datasets. The results and empirical analyses demonstrate the necessity and high-quality of InferWiki. Nevertheless, the performance gap among various inferential assumptions and patterns presents the difficulty and inspires future research direction. Our datasets can be found in https://github.com/TaoMiner/inferwiki.


Introduction
Knowledge Graph Completion (KGC) aims to predict missing links in KG by inferring new knowledge from existing ones. Attributed to its reasoning ability, KGC models are crucial in alleviating the KG's incompleteness issue and benefiting many downstream applications, such as recommendation (Cao et al., 2019b) and information extraction (Hu et al., 2021;Cao et al., 2020a). However, the KGC performance on existing benchmarks are still unsatisfactory -0.51 Hit Ratio@1 and 187 Mean Rank of the top-ranked model (Wang et al., 2019) on the widely used FB15k237 . Do we have a slow progress of  models (Akrami et al., 2020)? Or should we blame for the low-quality of benchmarks?
In this paper, we re-think the task of KGC and construct a new benchmark dubbed InferWiki that highlights three fundamental objectives: Test triples should be inferential: this is the essential requirement of KGC. Each test triple should have supportive samples in the train set. However, we observe two major issues of current KGC datasets: unpredictable and meaningless test triples, which may hinder evaluating and advancing stateof-the-arts. As shown in Table 1, the first example of inferring the location for David (i.e., Florida) is even impossible for humans -not to mention machines -merely based on his birthplace and nationality (i.e., Atlanta and USA). In contrast, the second one is predictable but meaningless to find the missing month from a list of months within a year. The above cases are very common in existing datasets, e.g., YAGO3-10 (Dettmers et al., 2018) and CoDEx (Safavi and Koutra, 2020), mainly due to their construction process: first collecting a highfrequency subset of entities and then randomly splitting their triples into train/test. In this setting, KGC models may be over-or under-estimated, as we are even unsure if a human can perform better.
Test triples may be inferred positive, negative, or unknown. Following open-world assumption: what is not observed in KG is not necessar-  (-\UNK) -\--\--\-10,311\--\--\-1,868\1,501 6,062\1,685 ily false, but unknown (Shi and Weninger, 2018). However, existing benchmarks generate unseen triples as negatives (i.e., the closed-world assumption), because KG contains only positive triples. They usually randomly corrupt the head or tail entity in a triple, sometimes with type constraints . This leads to trivial evaluation (almost 100% accuracy in triple classification (Safavi and Koutra, 2020)). Besides, the lack of unknown test ignores a critical inference capacity and may cause false negative errors in knowledge-driven tasks (Kotnis and Nastase, 2017). Inference has various patterns. Concentrating on limited patterns in evaluation may bring in severe bias. Domain-specific datasets Kinship  and Country  only focus on a few relations and are nearly solved (Das et al., 2017). General-domain WN18RR  contains prevalent symmetry relation types, which incorrectly boosts the performance of RotatE (Abboud et al., 2020). Clearly, limited patterns leads to unfair comparisons among KGC models.
To this end, we curated an Inferential KGC dataset extracted from Wikidata and establish the benchmark with two settings of varying in sizes and structures: InferWiki64k and InferWiki16k. Instead of random split, we mine rules via Any-BURL (Meilicke et al., 2019) to guide train/test generation. All test triples are thus guaranteed inferential from training data. To avoid the rule leakage, we utilize two sets of triples: a large set for high-quality rule extraction and a small set for train/test split. Moreover, we infer unseen triples and manually annotate them with positive, negative and unknown labels to improve the difficulty of evaluation following both closed-world and openworld assumptions. For inference patterns, we include and balance triples with different reasoning path length, relation types and patterns (e.g., symmetry and composition).
Our contributions can be summarized as follows: • We summarize three principles of KGC: inferential ability, assumptions and patterns, and construct a rule-guided dataset.
• We highlight the importance of negatives and unknowns, and initiate open-world evaluation.
• We conduct extensive experiments to establish the benchmark. The results and deep analyses verify the necessity and challenge of Infer-Wiki, providing insights for future research.

Related Work
We can roughly classify current KGC datasets into two groups: inferential and non-inferential datasets. The first group is usually manually curated to ensure each testing sample can be inferred from training data through reasoning paths, while they only focus on specific relations, such as Families , Kinship (Kemp et al., 2006), and Country . The limited scale and inference patterns make them not challenging. HOLE (Nickel et al., 2016) achieves 99.7% ACU-PR on the dataset of Country. The second group of datasets are automatically derived from public KGs and randomly split positive triples into train/test, leading to a risk of testing samples non-inferential from training data. Popular datasets include FB15k-237 , WN18RR , and YAGO3-10 . CoDEx (Safavi and Koutra, 2020) argues the scope and difficulty of the above datasets, thus propose a comprehensive dataset with manually verified hard negatives.
In fact, inference is an important ability for intelligence. Various fields study how inference is done in practice, ranging from logic to cognitive psychology. Inference helps people make reliable predictions, which is also an expected ability for AI models. Indeed, once deployed, a model may have to make a prediction when there is no evidence in the training set. But, instead of an unreliable guess, we highlight the ability to know unknown, a.k.a. open-world assumption. Therefore, we aim to curate an large-scale inferential benchmark InferWiki including various inference patterns and testing samples (i.e., positive, negative, and unknown), for better evaluation. We list the statistics in Table 2.

Dataset Design
We describe our dataset construction that comprises four steps: data preprocessing, rule mining, ruleguided train/test generation, and inferred test labeling. We then give a detailed analysis.

Data Preprocessing
More and more studies utilize Wikidata 1 as a knowledge resource due to its high quality and large quantity. We utilize the September 2019 English dump in experiments. Data preprocessing aims to define relation vocabulary and extract two sets of triples from Wikidata: a large one for rule mining T r and a relatively small one for dataset generation T d . The reason for using two sets is to avoid the leakage of rules. In other words, some frequent rules on the large set may be very few on the small set. The different distributions shall avoid that rule mining methods will easily achieve high performance. Besides, more triples can improve the quality of mined rules. In contrast, the relatively small set is enough for efficient KGC training and evaluation.
In specific, we first extract all triples that consist of two entity items and one relation with English labels. We then remove the repeated triples and obtain 40,199,175 triples with 7,734,841 entities and 1,170 different relation types. Considering rule mining efficiency, we reduce the relation vocabulary by (1) manually filtering out meaningless relations, such as movie ID or film rating, (2) removing relations of InstanceOf and subClassOf following existing benchmarks , (3) select the most frequent 500 relation types. We focus on the most frequent 800,000 entities, which result in 8,632,777 triples as the large set for rule mining. To obtain the small set for dataset construction, we further select the most frequent 120,000 entities and 300 relations, which result in 1,283,246 triples. Note that we also infer new triples and label them as positive, negative, or unknown later.

Rule Mining
Since developing advanced rule mining models is not the focus of this paper and several mature tools are available online, such as AMIE+ (Galárraga et al., 2015) and AnyBURL (Meilicke et al., 2019). We utilize AnyBURL 2 in experiments due to its efficiency and effectiveness.
Given a set of triples (i.e., the large set T r ), this step aims to automatically learn rules F = {(f p , λ p )} P p=1 , where f p denotes a horn rule, e.g., spouse(x, y) ∧ father(x, z) ⇒ mother(y, z), and λ p ∈ [0, 1] denotes the confidence of f p . For each rule f p , the left side of ⇒ is called the premise, and the right side is called the conclusion, where the conclusion contains a single atom and the premise is a conjunction of several atoms in the Horn rule scheme. We can ground specific entities to replace x, y, z in f p , which shall denote an inferential relationship between premise and conclusion triples. For example, given spouse(LeBron James, Savannah Brinson) and father(LeBron James, Bronny James), we may infer a new triple mother(Savannah Brinson, Bronny James).
Of course, not all of the mined rules are reasonable. To alleviate the negative impacts of unreasonable rules, we rely on more data (a large set of triples) and keep high-confidence rules only. Particularly, we follow the suggested configuration of AnyBURL. We run it for 500 seconds to ensure that all triples can be traversed at least once and obtain 251,317 rules, where 168,996 out of them whose confidence meets λ p > 0.1 have been selected as the rule set to guide dataset construction.

Rule-guided Dataset Construction
Different from existing benchmarks, InferWiki provides inferential testing triples with supportive data in the training set. Moreover, it aims to include as many inference patterns as possible and these patterns are better evenly distributed to avoid biased evaluation. Thus, this step has four objectives: ruleguided split, path extension, negative supplement, and inference pattern balance. Rule-guided Split grounds the mined rules F on triples T d to obtain premise triples and corresponding conclusion triples. All premise triples form a training set, and all conclusion triples form a test set. Thus, they are naturally guaranteed to be inferential. For correctness, all of premise triples must exist in the given triple set T d , while conclusion triples are not necessarily in T d and may be generated for further annotation (i.e., Section 3.4).
For example, given a rule spouse(x, y) ∧ father(x, z) ⇒ mother(y, z), we traverse all of the given triples and find entities LeBron James, Savannah Brinson, and Bronny James that meet the premise. We then add the premise triples spouse(LeBron James, Savannah Brinson) and father(LeBron James, Bronny James) into the training set, and generate the conclusion triple mother(Savannah Brinson, Bronny James) for testing, no matter it is given or not. Path Extension aims to increase the inference path patterns by (1) adding more reasoning paths for the same testing triple, and (2) elongating paths by replacing those premise triples that have reasoning paths. For example, we replace father(LeBron James, Bronny James) with two triples that can infer it: father(LeBron James, Bryce James) and brother(Bronny James, Bryce James). The original path is then extended by one hop. Correspondingly, we define the confidence of extended paths as the multiplication of all involved rules. Longer paths will challenge long-distance reasoning ability. Negative Supplement is to generate negative triples if we cannot annotate the same number of negatives with positive triples. Otherwise, we will face an imbalance issue. Following conventions, we randomly corrupt the head or tail entities in a positive triple with the following constraints: (1) the relation of the positive triple is exclusive, e.g., placeOfBirth, if the ratio from head to tail entities is smaller than a threshold (we choose 1.2 heuristically in experiments); otherwise, the corrupted negative triple may be actually positive, leading to false negative errors. (2) We choose positive triples from the test set for corruption to improve the difficulty -the model has to correctly infer the corresponding positive triple from training data, then classify the corrupted triple as negative through the confliction. Particularly, for non-exclusive relation types, most of their corrupted results should be unknown following open-world assumption. The inferred test set covers such cases, which will be discussed in Section 3.4. Inference Pattern Balance aims to balance various inference patterns, including path length, relation types, and relation patterns (i.e., symmetry, inversion, hierarchy, composition, and others). This is because concentrating on some patterns may lead to severe bias and unfair comparison between KGC models . We first count the frequency of testing triples according to the path lengths, relation types and patterns, respectively. For each of them, we rank their counting and choose highest ranked groups of triples as frequent ones, instead of setting a threshold. We then carefully remove some frequent triples randomly, until the new distributions reach an accepted range (checked by humans).

Inferred Test Triple Labeling
Different from existing datasets, InferWiki aims to include positive, negative, and unknown testing triples, to evaluate the model under two types of assumptions: open-world assumption and closedworld assumption. The main difference between them is whether unknown triples are regarded as negatives. That is, the open-world evaluation is a three-class classification problem (i.e., positive, negative, and unknown). The closed-world evaluation targets only positive and negative triples, and we can simply relabel unknown triples as negatives without changing the test set.
So far, we have two test sets: one is generated via rule guidance, and the other contains the supplemented negatives. This section aims to label the generated triples. First, we automatically label the triples with positive if they exist in Wikidata. Then, we manually annotate the remaining 4,053 triples. The annotation guideline can be found in Appendix B. Note that all of the unknowns are factually incorrect but not inferential. To assess the quality of annotations, we verify a random selection of 300 test triples (100 for each label). The annotators agree with our labels 84.3% of the time. We further investigate the disagreements by relabeling 100 samples. 85% of the time, humans prefer an unknown, while automatic labeling tends to assign them with positive or negative labels. This suggests the inferential difference between humans and machines -the capacity of knowing unknown.
Finally, we remove the entities that are not in any of the grounded paths and their triples. We randomly select half of the test set as valid. This forms InferWiki64k. We further extract a dense subset InferWiki16k by filtering out the positive triples whose confidence is smaller than 0.6. Correspondingly, negative/unknown triples are reduced to keep balance. The statistics is listed in Table 2.   Table 3 shows positive, negative, and unknown examples of InferWiki and their (possible) supportive training data. For positives, their paths seem reasonable and vary in length, relation types, and patterns. The 7-hop path of the sibling example is even difficult for a human. For negatives and unknowns, they are indeed incorrect and more challenging. There are no direct contradicted triples in the train set -the model is encouraged to reason related triples and justify if there is a confliction (i.e., negative) or not (i.e., unknown). Nevertheless, there are two minor issues. First, some unreasonable paths may corrupt the predictability. We thus increase the rule confidence threshold λ > 0.6 for InferWiki16k and manually annotate uncertain test triples for the correctness of labels. More advanced rule mining models can improve the construction pipeline. We leave it in the future. Second, does unknown triples have a bias on certain relation types?

Dataset Analysis
The answer is yes but not exactly. As shown in Table 3, the relation connectsW ith is involved in both positive and unknown triples, which is also determined by the paths.
Next, we analyze the relation patterns and path length distribution through comparisons with existing KGC datasets. Due to the different construction pipelines, existing datasets are difficult to offer quantitative statistics. We thus apply our pipeline on CoDEx (Safavi and Koutra, 2020). Only inferential test triples remain, and the training set keeps unchanged, namely CoDEx-m-infer, which reduces the test and valid positives from 20,622 to 7,050. This agree with the original paper that reports 20.56% triples are symmetry or compositional through AMIE+ analysis. We find more paths due to more extensive rules extracted from a large set of triples. This also demonstrates the necessity of rule-guided train/test generation -most test triples are not guaranteed inferential when using random split. Relation Pattern Following convention, we count reasoning paths for various patterns: symmetry, inversion, hierarchy, composition, and others, whose detailed explanations and examples can be found in Appendix C. If a triple has multiple paths, we count all of them. As Figure 1 shows, we can see that (1) there are no inversion and only a few symmetry and hierarchy patterns in CoDEx-m, as most current datasets remove them to avoid train/test leakage. But, we argue that learning and remembering such patterns are also an essential capacity of inference. It just needs to control their numbers for a fair comparison. (2) The patterns of InferWiki is more evenly distributed. Note that the patterns of symmetry, inversion, and hierarchy refer to 1hop paths, while composition and others refer to multi-hop paths. So, the total number of the former three is almost the same as that of the latter two, to balance paths with varying lengths, which will be discussed next. Path Length Distribution The reasoning paths can ensure test triples' predictability but may not be the shortest ones, as there may be undiscovered paths connecting two entities. Thus, our statistics concerning path length offer a conservative analysis and give an upper bound. For a test triple with multiple paths, we count the shortest one. As shown in Figure 2, we can see that InferWiki has more long-distance paths, while CoDEx-m-infer normally concentrates on maximum 3-hop reasoning paths. In specific, the maximum path length of InferWiki is 9 (4 before path extension) and the average length is 2.9 (1.5 before path extension).
Further analysis of relation, entity and neighbor distributions can be found in Appendix D&E.

Limitation
Although we carefully design the construction of in-ferWiki, there are still two types of limitations: rule biases and dataset errors, that can to be addressed along with the development of KG techniques in the future. In terms of rule biases, AnyBURL may be over-estimated due to its role in the construction. Although we utilize two triple sets to avoid rule leakage, their overlap may still bring unfair performance gain to AnyBURL. We consider synthesize several rule mining results to improve InferWiki in the next version. In terms of dataset errors, first, to balance positive and negative triples in the larger InferWiki64k, we follow conventions to randomly sample a portion of negatives. These negatives may be unknown if following open-world assumption. We manually assess the randomly sampled negatives and find a 15.7% error rate. Therefore, we conduct open-world experiments on the smaller InferWiki16k, all of whose testing negatives are verified by humans. The second type of errors is due to unreasonable rules for dataset split, which is caused by prediction errors of existing rule mining models. However, there is no suitable evaluation in this field to provide quantitative analysis. Our ongoing work aims to develop an automatic evaluation for path rationality to improve the mining quality, and thus facilitate our inferential pipeline.

Tasks
We benchmark performance on InferWiki for the tasks: (1) Link Prediction, the task of predicting the missing head/tail entity for a given query triple (?, r, t) or (h, r, ?). Models are encouraged to rank correct entities higher than others in the vocabulary. We adopt the filtering setting (Bordes et al., 2013) that excludes those entities, if the predicted triples have been seen in the train set. Mean reciprocal rank (MRR) and hits@k are standard metrics for evaluation. (2) Triple Classification aims to predict a label for each given triple (h, r, t). The label following open-world assumption is trinary y ∈ {−1, 0, 1} and becomes binary y ∈ {−1, 1} when adopting closed-world assumption -all 0-label triples are re-labeled with −1, since our unknown triples are factually negative yet non-inferential from training data. Since KGC models output real-value scores for triples, we classify scores into labels by choosing one or two thresholds per relation type on valid. Accuracy, precision, recall, and F1 are measurements.

Models
For comprehensive comparison, we choose three types of representative models as baselines: (1 . Note that the latter two are specially designed for link prediction. The detailed implementation including parameters and thresholds can be found in Appendix F.   well -around 90% F1 scores. This is consistent with recent findings that triple classification is a nearly solved task (around 98% F1 scores) (Safavi and Koutra, 2020). Nevertheless, the lower performance demonstrates the difficulty of our curated datasets, mainly due to the manually annotated hard negatives of InferWiki (and CoDEx). Figure 3 presents the accuracy on InferWiki16k regarding various types of triples: positive, random supplemented negatives, and annotated negatives (including relabeled unknowns). We can see that (1) random negative triples are indeed trivial for all of baseline models, which motivates the necessity of harder negative triples to push this research direction forward, (2) positive triples are slightly difficult to judge than random negatives, and (3) the accuracy significantly drops on annotation negatives. This is mainly because most annotated triples are actually unknown -they are factually incorrect, but there are no obvious abnormal patterns. Such non-inferential cases may underestimate KGC models.

Open-world Assumption
Since most baselines fail in judging unknown as negative, we now investigate them following open-world assumption to see their ability in recog-  nizing unknown triples. Table 5 shows the macro performance 3 on InferWiki16k. We can see that all of the baseline models perform worse than those under the closed-world assumption. On one hand, the trinary classification is intuitively more difficult than binary classification. On the other hand, it is a rather straightforward method to search two decision thresholds -one between positive and unknown and the other between unknown and negative. This motivates us future works on advanced models to represent KG, which should also be able to detect the limitation and boundaries of given KG. It is a fundamental capacity of inference to respond "I do not know", to avoid false negatives in downstream applications. Figure 4 presents a detailed analysis of each model regarding their search thresholds. We can see that although their best performance seems not bad, the worst scores are only around 10%. That is, they are very sensitive to thresholds. Besides, most of the time, the average F1 scores of ComplEx, Ro-tatE, and TuckER are around 20%, while transE achieves higher scores. Maybe that is the reason why it is still the most widely used KGC method. ConvE stably outperforms other baselines, no matter in terms of best, worst, or average performance. Table 6 shows the average scores for head and tail prediction. We can see that (1) AnyBURL performs the best most of the time, but the performance gap is not significant. This is mainly due to its role in  Table 6: Results of Link Prediction. Bold fonts denote the best scores and underlines highlight the second best. dataset construction, although we utilize two sets of triples to minimize rule leakage. Actually, inference of rules may be more important than we thought to improve the reliability and interpretability of knowledge-driven models. This also motivates us to incorporate rule knowledge into KGC training for advanced reasoning ability Li et al., 2019b).

Link Prediction Results
(2) KGC models perform better on InferWiki16k than InferWiki64k, due to the higher structure density and rule confidence.
(3) Models have higher hit@10 and lower hit@1 on InferWiki than other datasets (e.g., CoDEx). This agrees with an intuition that most entities are irrelevant, making it trivial to judge these corrupted triples as in triple classification. And, only a small portion of entities is difficult to predict, which requires strong inference ability. Besides, hit@1 varies a lot, so that we can better compare among models. Impacts of Inferential Path Length Figure 5 presents Hit@1 curves for tail prediction regarding varying path length on Infer-Wiki64k 4 . We can see an overall downwards trend along with the increasing path length. Meanwhile, the large fluctuation may be due to two possible reasons: (1) as discussed in Section 3.5, the inferential paths ensure the predictability, but may not be the shortest ones. This thus offers a conservative analysis and give an upper bound of the performance concerning k-hop paths. Our paths are of high coverage and quality compared with existing datasets, which either conduct case study or postprocess datasets via rule mining. (2) Relation types and patterns also have significant impacts. Shorter paths contain more long-tail relations, and longer paths tend to cover many common relations. This improves the difficulty of shorter paths and makes longer paths easier.

Impacts of Relation Patterns
We present the Hit@1 tail prediction on Infer-Wiki64k regarding relation patterns in Table 7. We can see that symmetry and inversion are not wellsolved, which should be considered into evaluation but limited in scale. TransE performs worse on symmetry and inversion relations, consistent with the analysis in Abboud et al. (2020). Even if ComplEx and RotatE can capture such patterns, they fail to rank corresponding entities at the top. Embedding-based models perform well on hierarchy relations, even outperforms AnyBURL. For compositional relations, it is still quite challenging and worthwhile further investigation.

Comparison of CoDEx-infer and CoDEx
We investigate the impacts of rule-based train/test generatation by comparing CoDEx-m-infer with Sym Inv Hier Comp Others TransE .000 .049 .479 .211 .296 ComplEx .130 .279 .502 .368 .414 RotatE .191 .246 .694 .477 .610 ConvE .558 .668 .855 .602 .784 TuckER .527 .612 .850 .625 .753 Multihop .231 .309 .345 .240 .296 AnyBURL .782 .793 .782 .686 .809 Table 7: Hit@1 tail prediction on Relation Patterns.  CoDEx-m. The two datasets share the same training set. The only difference lies in how we obtain the test triples, either using our proposed pipeline (CoDEx-m-infer) or randomly (CoDEx-m). Thus, the results reflect the impacts of inferential guarantee for dataset construction and demonstrate the necessity to avoid over-estimation or underestimation of the inferential ability of KGC models. We report the performance on CoDEx-m from the original paper (Safavi and Koutra, 2020).
We can see that all of models perform better with inferential path guarantee on CoDEx-m-infer than CoDEx-m, except ComplEx for link prediction. This is because rule guidance elimites those noninferential testing triples, making the task easier. Nevertheless, the scores on hard cases are actually decreased (as discussed in Figure 3 and Table 7). Models are excepted a stronger reasoning ability among several related entities, instead of trivially filtering out massive irrelevant entities. This also demonstrates the necessity of InferWiki to avoid over-or under-estimation of the inferential ability of KGC models -learning new knowledge from existing ones.

Case Study of Relation Types
We illustrate the most frequent relation types and their distribution of InferWiki64k and Infer-Wiki16k in Figure 8. We can see that InferWiki has a diverse relation types that are not limited to specific domains. Besides, the triples of each relation type are well balanced.

Conclusion
We highlighted three principles for KGC datasets: inferential ability, assumptions, and patterns, and contribute a large-scale dataset InferWiki. We established a benchmark with three types of seven KGC models on two tasks of triple classification and link prediction. The results present a detailed analysis regarding various inference patterns, which demonstrates the necessity of an inferential guarantee for better evaluation and the difficulty of new open-world triple classification.
In the future, we are interested in cross-KGs inference and transfer , and investigating how to inject knowledge into deep learning architectures, such as for information extraction (Tong et al., 2020) or text generation (Cao et al., 2020b). Table 8 lists existing KGC datasets. We can roughly classify them into two groups: inferential and noninferential datasets. The first group are usually manually curated to ensure each testing sample can be inferred from training data through reasoning paths. Families  test family relationships including cousin, ancestor, marriage, parent, sibling, and uncle, among the members of 5 families along 6 generations. Such that there are obvious compositional relationships like uncle ≈ sibling + parent or parent ≈ married + parent. Kinship (Kemp et al., 2006) contains kinship relationships among members of the Alyawarra tribe from Central Australia, while Country  contains countries, regions, and subregions as entities and is carefully designed to explicitly test the location relationship (i.e., locatedIn and neighbor) among them. The above datasets are clearly limited in scale and inference patterns, thus become not challenging. HOLE (Nickel et al., 2016) even achieves 99.7% ACU-PR on dataset Country .
The second group of datasets are automatically derived from public KGs and randomly split positive triples into train/valid/test, leading to a risk of testing samples non-inferential from training data. FB13 (Socher et al., 2013) and FB15K  are commonly used benchmark from FreeBase. FB15k401 ) is a subset of FB15k containing only frequent relations (relations with at least 100 training examples). To remove test leakage, FB15k-237  removes all equivalent or inverse relations. Similarly, FB5M  removes all the entity pairs that appear in the testing set. WN18RR  is the challenging version of WN18  extracted from WordNet. Textual information is also included for specific task, such as FB40K (Lin et al., 2015) targeting relation extraction dataset New York Times (Riedel et al., 2010). FB24K (Lin et al., 2016) introduce Attributes. FB15K+  introduce types and make fb15k more sparse by only filterring out relation with a frequency lower than one. Another popular knowledge source is YAGO, and the corresponding datasets include YAGO3-10  and YAGO37 . Except for open-domain KG, NELL  concentrates on location and sports, and UMLS  targets medical knowledge. CoDEx (Safavi and Koutra, 2020) argues the quality of the above benchmarks, such as NELL995  are nonsensical or overly generic. Thus they propose a comprehensive dataset consisting of three knowledge graphs varying in size and structure, entity types, multilingual labels and descriptions, and hard negatives.

B Annotation Guideline
We provide the following annotation guidelines for annotators to label inferred triples in Section 3.4.
Task This is a two-step annotations. First, you must annotate each triple with the label y ∈ {1, −1}, where 1 denotes that the triple is correct and −1 denotes that the triple is incorrect. You can find the answer from anywhere you want, such as commonsense, Wikipedia, and professional websites. If you cannot find any evidence to support the statement, you shall choose label −1. Second, you must annotate each incorrect triple with the labelŷ ∈ {0, −1}, where 0 denotes that you do not know the answer. Now, you can find the answer from our provided triples. If you cannot find any evidence to support the statement, you shall choose label 0.
Examples Here are some examples judged using three types of knowledge sources. • Wikipedia: Given the triples (Tōkaidō Shinkansen, connectsWith, Osaka Higashi Line) and (Tōkaidō Shinkansen, con-nectsWith, San'yō Main Line), you can find related station information from the page of Tōkaidō Shinkansen. You can find that Osaka Higashi Line shares a transfer station with Tōkaidō Shinkansen, thus label it with 1. And, San'yō Main Line doesn't show up in the page, you may label it with −1.

C Relation Patterns
InferWiki is able to analyze relation patterns for each path, including symmetry, inversion, hierarchy, and composition, where detailed explanations and examples are listed in Table 9.

D Relation Types
We illustrate the most frequent relation types and their distribution of InferWiki64k and Infer-Wiki16k in Figure 8. Figure 9 shows the distribution of entities and their neighbors as compared to widely used datasets: FB15k237 and CoDEx-m.

F Experiment Setup
Our experiments are run on the server with the following configurations: OS of Ubuntu 16.04.6 LTS, CPU of Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, and GPU of GeForce RTX 2080 Ti. We use OpenKE 5 for re-implementing TransE, Com-plEx, and RotatE. For the rest models, we use the original codes for ConvE 6 , TuckER 7 , Multi-5 https://github.com/thunlp/OpenKE 6 https://github.com/TimDettmers/ConvE 7 https://github.com/ibalazevic/TuckER hop 8 , and AnyBURL 9 . Because we utilize various types of KGC models including embedding-based, multi-hop reasoning (reinforcement learning), and rule-based models, these models largely have their own hyperparameters. To avoid exhaustive parameter search in a large range, we conduct a series of preliminary experiments and find that the suggested parameters work well on Wikidata-based data. We then search the embedding size in the range of {256, 512}, number of negative samples in the range of {15, 25} and margin in the range of {4, 8}. The optimal parameters of each model on all of three datasets are listed in Table 10. The thresholds in triples classification are listed in Table 11