KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion

A comprehensive knowledge graph (KG) contains an instance-level entity graph and an ontology-level concept graph. The two-view KG provides a testbed for models to"simulate"human's abilities on knowledge abstraction, concretization, and completion (KACC), which are crucial for human to recognize the world and manage learned knowledge. Existing studies mainly focus on partial aspects of KACC. In order to promote thorough analyses for KACC abilities of models, we propose a unified KG benchmark by improving existing benchmarks in terms of dataset scale, task coverage, and difficulty. Specifically, we collect new datasets that contain larger concept graphs, abundant cross-view links as well as dense entity graphs. Based on the datasets, we propose novel tasks such as multi-hop knowledge abstraction (MKA), multi-hop knowledge concretization (MKC) and then design a comprehensive benchmark. For MKA and MKC tasks, we further annotate multi-hop hierarchical triples as harder samples. The experimental results of existing methods demonstrate the challenges of our benchmark. The resource is available at https://github.com/thunlp/KACC.

the concept view) contains concepts and conceptual relations. It provides abstract and commonsense knowledge like (painter, create, painting). In this paper, we name this kind of two-view KG as the entity-concept KG (EC-KG). In a EC-KG, the relations can be grouped into three categories. The "subclassOf" relation forms hierarchical concept structures via triples like (painter, subclassOf, artist). The "instanceOf" relation groups entities into concepts, such as (Da Vinci, instanceOf, painter). These two relations are important for testing models' abilities on knowledge abstraction and concretization. Other relations are logical relations for testing models' abilities on knowledge completion. An example of the EC-KG is shown in Figure 1. During the last decade, there are massive works focusing on learning representations for KGs such as TransE (Bordes et al., 2013), DistMult , ComplEx (Trouillon et al., 2016), and TuckER (Balažević et al., 2019). Though they have achieved promising results on knowledge graph completion, most of them focus on a single graph, especially the entity graph.
Beyond modeling a single graph of KGs, recent studies demonstrate that jointly modeling the two graphs in the EC-KG can improve the understanding of each one (Xie et al., 2016;Moon et al., 2017;Lv et al., 2018;Hao et al., 2019). They also propose several tasks on the EC-KG, such as link prediction and entity typing. These tasks focus on partial aspects of knowledge abstraction, concretization, and completion, which are essential abilities for humans to recognize the world and acquire knowledge. For example, in entity typing, a model may link the entity "Da Vinci" to the concept "painter" which reflects the model's abstraction ability. However, little work has been devoted to unified benchmarking and studies on KACC.
In this paper, we present a comprehensive benchmark for KACC by improving existing benchmarks in dataset scale, task coverage, and difficulty.
Dataset scale. We have examined the EC-KGs proposed by previous works such as Hao et al. (2019). Due to the data distribution, the concept graphs are small compared to the entity graphs. Furthermore, the cross-links between the two graphs are also sparse (refer to Section 4.3). These may limit the knowledge transfer between the two graphs. To tackle these problems, we construct several different-scale datasets based on Wikidata (Vrandečić and Krötzsch, 2014) with careful filtering, annotation and refinement. As Wikidata contains more fine-grained concepts, our datasets have large concept graphs, abundant crossview links, as well as dense entity graphs.
Task coverage. Most previous works focus on partial tasks of KACC. In our benchmark, we define the tasks more comprehensively and categorize these tasks into three classes: knowledge abstraction, concretization, and completion.
Difficulty. We propose two new tasks, including multi-hop knowledge abstraction and multihop knowledge concretization, which require models to predict multi-hop "instanceOf" and "subclassOf" triples that do not exist in KGs but can be inferred via relation transitivity. These tasks are meaningful and important since correctly modeling these triples is necessary for models to truly understand the concept hierarchy. To ensure the quality of these tasks, we annotate corresponding multi-hop datasets. Our experiments show that these tasks are still challenging for existing models.
Based on our benchmark, we conduct extensive experiments for existing baselines and provide thorough analyses. The experiments show that while the methods specifically designed for modeling hierarchies perform better than general KGE models on abstraction and concretization tasks, they are not competitive to some general KGE models on logi-cal relations. Moreover, all methods have drastic performance degradation on multi-hop tasks, and the knowledge transfer between the entity graph and the concept graph is still obscure. Finally, we present useful insights for future model design.
2 Related Work

Knowledge Graph Datasets
Existing datasets for knowledge graph completion are usually subgraphs of large-scale KGs, such as FB15K, FB15K-237, WN18, WN18RR and CoDEx (Bordes et al., 2013;Toutanova et al., 2015;Dettmers et al., 2018;Safavi and Koutra, 2020). These datasets are all single-view KGs, in which FB15K, FB15K-237, and CoDEx focus on the entity view while WN18 and WN18RR can be regarded as concept view KGs. Several datasets try to link the two views in different ways. Firstly, some datasets provide additional type information to the entity graph, such as FB15K+, FB15K-ET and YAGO43K-ET (Xie et al., 2016;Moon et al., 2017). Secondly, some datasets provide concept hierarchies for the entity graph, such as Probase (Wu et al., 2012) and YAGO39K (Lv et al., 2018). However, they do not provide the full concept graphs with logical relations. Thirdly, some datasets provide the full concept graphs (Hao et al., 2019), but both the scale and the depth of the concept hierarchy are limited. For example, the entity numbers of DB111K-174 (Hao et al., 2019) and our dataset KACC-M are similar, but KACC-M has 38 times more concepts than DB111K-174 (see Table 1).

Knowledge Embedding Methods
Existing knowledge embedding (KE) methods can be categorized as translation models (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Ji et al., 2015;, tensor factorization based models Nickel et al., 2016;Trouillon et al., 2016;Balažević et al., 2019), and neural models (Socher et al., 2013;Dettmers et al., 2018;Nguyen et al., 2018). These methods are typically designed for single-view KGs. Although they can be directly applied to EC-KGs by ignoring different characteristics between entity graphs and concept graphs, they cannot take full advantage of the information in EC-KGs.
Several works (Krompaß et al., 2015;Xie et al., 2016;Ma et al., 2017;Moon et al., 2017) incorporate the type information into KE methods to help the completion of entity graphs. ETE (Moon et al., 2017) further conducts entity typing, which can be seen as a simplified version of our knowledge abstraction task. Though types of entities can be seen as concepts, they omit the concept hierarchy and interactions (conceptual relations) between concepts.
To jointly model the whole EC-KG, TransC (Lv et al., 2018) adopts TransE as the entity graph model and models concepts as spheres that enclosing points of entities. However, it is not flexible enough to model logical relations between concepts. AttH (Chami et al., 2020) further combines hyperbolic embedding methods with KGE methods to simultaneously embed hierarchical and logical relations. JOIE (Hao et al., 2019) uses different training paradigms for training the entity graph, the concept graph, and the cross-view links. It also defines several meaningful tasks on the EC-KG. In this paper, we extend these tasks with several newly proposed tasks, then we categorize and formalize these tasks into a unified benchmark. We also test several KE methods as mentioned above using our benchmarks and analyze their advantages and deficiencies in terms of handling these tasks.

Benchmark
In this section, we propose the KACC benchmark with three tasks: knowledge abstraction, knowledge concretization, and knowledge completion.

Formalizations
We first give formalizations of the EC-KG, then we introduce multi-hop triples used in later tasks.
Formalizations of EC-KG. A EC-KG is a comprehensive KG, which contains two subgraphs and the cross-view links. The entity graph G E = {E E , R E , T E } is composed of the entity set E E , relation set R E , and corresponding triple set where h, r, t represent head, relation, and tail of a triple, respectively. The concept graph G C = {E C , R C , T C } contains the concept set E C , conceptual relation set R C , and triple set T C . In our settings, E E and E C are disjoint sets, while R E and R C may contain some relations in common (see Section 4.3). The cross-view links T S = {(h S , r ins , t S )} connects the two subgraphs, where h S ∈ E E , t S ∈ E C , and r ins is the "instanceOf" relation. Therefore, the EC- There are two special relations "instanceOf" and "subclassOf" that are crucial for knowledge abstraction and concretization. We use "ins" and "sub" to denote them in the rest of our paper, respectively. The corresponding triples are T ins = T S and T sub ⊂ T C . Other relations are logical relations. Their corresponding triples in concept graphs are T C(logic) = T C \T sub and logical triples in the entity graphs are T E(logic) = T E .
Multi-hop Triples. Hierarchical relations like "ins" and "sub" should preserve the multi-hop transitivity, which can be explained by two rules: in which {c i |i ≥ 1} are defined as the high-level concept for e and c 0 . These two rules indicate that an entity/concept always belongs to its highlevel concepts. With these rules, we can collect multi-hop hierarchical triples like (e, ins, c N ) and (c 0 , sub, c N ) from the train data and use them as harder samples for knowledge abstraction and concretization testing. Corresponding datasets of multi-hop hierarchical triples are denoted as T M-Ins and T M-Sub .

Knowledge Abstraction
This task contains tail prediction tasks for one-hop and multi-hop "ins" and "sub" triples. We use KA (knowledge abstraction) and MKA (multi-hop knowledge abstraction) to denote the tasks. KA-Ins / KA-Sub: KA-Ins and KA-Sub are tail prediction tasks for "ins" triples and "sub" triples respectively. These are all triples in the original datasets and these tasks reflect the direct knowledge abstraction ability of models.
MKA-Ins / MKA-Sub: MKA-Ins and MKA-Sub are tail prediction tasks for multi-hop hierarchical triples T M-Ins and T M-Sub . These tasks reflect models' abilities on high-level concept abstraction, which aim to predict upper concepts multiple hops away in the concept hierarchy.

Knowledge Concretization
Similar to KA and MKA tasks, this task contains KC (knowledge concretization) and MKC (multihop knowledge concretization) tasks.
KC-Ins / KC-Sub: KC-Ins and KC-Sub are head prediction tasks for "ins" and "sub" triples, which aim to predict entities for concepts or lowlevel concepts for high-level ones.
MKC-Ins / MKC-Sub: These tasks are head prediction tasks for multi-hop hierarchical triples. These tasks aim to predict entities for concepts or predict finer concepts for coarser concepts that are multi-hops away.

Knowledge Completion
The knowledge completion task contains the subtasks of entity graph completion (EGC) and concept graph completion (CGC) under two settings. In the "Single" setting, models can only use each single graph to do knowledge graph completion while both the two graphs and the cross-view links are provided in the "Joint" setting.
CGC-Single / EGC-Single: These subtasks are conducted on each single graph G C and G E . The test phase is conducted on logical triples of each graph T E and T C(logic) . The results can be compared with results from CGC/EGC-Joint to test the effectiveness of jointly modeling the two graphs.
CGC-Joint: This subtask requires the model to do link prediction with the full information of the EC-KG G. The model needs to abstract conceptual knowledge from the entity graph to do link prediction for logical concept graph triples T C(logic) . The results of this subtask can also be used to verify models' abilities on knowledge abstraction.
EGC-Joint: Models are required to use the guidance from the concept graph to do link prediction for entity graph triples T E . For example, a person in the entity graph is more likely to lead some organizations if he is a politician.

Dataset Construction
In this section, we first provide the details of our data collection process and annotation process. Then we give a detailed analysis of the statistical characteristics of the datasets.

Dataset Collection
The dataset construction process has four steps: Step 1: Entity Filtering. We select entities in FB15K-237 (Toutanova et al., 2015) as our seed entities. We first find out corresponding seed entities in the Wikidata dump via the "FreebaseID" property of each item. Note some entities in Freebase may be labeled as concepts in Wikidata, so we filter out these concepts in our seed entity set. Then we extract one-hop neighbors of the seed entities in the entity graph to form the entity pool, which contains more than 10 million entities. With the entity pool, we can sample an arbitrary size of one-hop neighbors to form the entity graph of our dataset. Our sampling strategy is to select entities with highest degrees and the final entity set consists of all seed entities and the sampled one-hop neighbors. To meet the requirements for different scales, we propose three sizes of datasets: (1) KACC-S, the dataset only contains the seed entities; (2) KACC-M, the expected total entity number is set to 100K; (3) KACC-L, the entity number is set to 1M.
Step 2: Concept Finding. Next, we extract concepts based on selected entities. We use a breadthfirst search algorithm to find the concepts. The algorithm starts from entities and search for concepts via "ins" triples and "sub" triples. Since the concept hierarchy follows the structure of a directed acyclic graph, our algorithm ends when all potential concepts are found.
Step 3: Triple Extracting and Filtering. This step firstly extracts cross-view links and all triples in the entity/concept graph. Then we filter all triples by relation statistics and annotation. Relations (1) with less than 10 triples, (2) whose head or tail entity set's size is smaller than 10, and (3) which are annotated meaningless are dropped. Similar to Toutanova et al. (2015), we further remove reverse relations to prevent valid/test leakage.
To get more precise concept graphs, we ask human annotators to find out meaningless concepts and we further remove these concepts. These "meaningless" concepts include concepts with no labels or descriptions, concepts used for the self-construction of Wikidata (e.g. "Wikimedia list article"), etc. Details of the annotation step can be found in Appendix A.1.

Multi-hop Triple Annotation
To support MKA/MKC tasks, we extract multi-hop "ins" and "sub" triples from corresponding train sets according to rule (1) and rule (2). Ideally, hierarchical triples should preserve the multi-hop transitivity. However, when we dive into real-world  datasets like YAGO26K-906 and ours, we find that the multi-hop transitivity does not always hold true. As illustrated in Figure 2, the transitivity is violated when the transition link goes deep. (scientist, sub, occupation) is meaningful while (scientist, sub, human activity) is not. To make our multi-hop triples meaningful, we further ask human annotators to check the validity of these triples. Details are in Appendix A.2.

Dataset Analysis
In this subsection, we compare our datasets with existing datasets YAGO26K-906 and DB111K-174 (Hao et al., 2019) in terms of scale, domain coverage, and hierarchical relations. The statistics of these datasets are shown in Table 1.
Scale. From Table 1, we can see that concept graphs in our three datasets have more balanced sizes compared to entity graphs. From the comparison between DB111K-174 and KACC, we can see that entity graphs of these two datasets have similar sizes, but KACC has more concepts, conceptual relations, and triples in the concept graph.
Our datasets also have rich cross-view links. In Table 1, the average numbers of cross-links for each entity are less than 1.0 in YAGO26K-906 (0.38) and DB111K-174 (0.89), which means lots of entities in these datasets are not connected to concepts. In KACC, the ratios are all above 1.09, indicating that one entity may belong to multiple concepts. Domain Coverage. In Figure 3, we plot 15 most  Hierarchical Relations. We present the characteristics of hierarchical relations in our datasets.
We first examine the data quality of sub triples in each dataset. We first detect duplicate edges and self-loops. As the global structure of sub relations is assumed to be a directed acyclic graph, we use the topological sort algorithm to find loops. We report numbers of concepts that are not detected by the algorithm in each dataset (these concepts are in loops or dangled in loops). The statistics are in Table 2. We can see that our datasets are of high quality with fewer duplicate edges, no selfloops, and less proportion of concepts in loops. Finally, we remove duplicate edges and wronglabeled triples in loops after a manual check.
Then we examine the depths of the concepts in each dataset. We start from bottom concepts and traverse all concepts via topological sort. We plot numbers of concepts with different depths in Figure 4. From the figure, we can see that our datasets have deeper hierarchical structures than others, which are more informative and useful for   models to learn more fine-grained representations.
Finally, we show the characteristic of the "ins" relation in our datasets. Unlike existing datasets where "ins" only connects entities and concepts, concepts in Wikidata also have "ins" connections, which are denoted by T C(ins) . We find these triples are also meaningful as they reflect different level semantics. For example in a triple (planet, ins, astronomical object type), "planet" is a concept while it can also be regarded as an instance when mentioned in a higher level. The finding is also compatible with human cognition. So we remain these triples in our datasets, and we test them together with other "ins" triples. Therefore, we modify the corresponding definitions in Section 3.1 into T ins = T S ∪ T C(ins) and T C(logic) = T C \(T sub ∪ T C(ins) ).
Other Characteristics. Our datasets also have some new properties. In existing datasets, relations in the entity graph and conceptual relations in the concept graph are disjoint. However, in our datasets, some relations appear in both the entity graph and the concept graph. For example, the "partOf" relation appears in (Chile, partOf, South America) in the entity graph and (hospital, partOf, health system) in the concept graph. Our experiments treat them as different relations while models can also treat these relations as the same, which depends on the hypotheses of the designers.

Dataset Partition
Considering the tradeoff between scale and training efficiency, we use the medium-sized dataset KACC-M to conduct the experiments. To generate each task's train/valid/test data, we firstly split each triple set T S , T E and T C by the proportion 8:1:1. To make it easy for model training and hyperparameter selection, we provide a unified train set T Train EC for all tasks defined on the EC-KG, that is T Train For the valid and test sets, different tasks have their own valid/test sets for model selection and performance reports.
Train sets are different for EGC-Single and CGC-Single. As they focus on a single graph, we use T Train E /T Train C as train sets respectively. The settings of datasets for different tasks are in Table 3. The statistics are in the Appendix A.3.

Baselines
To test how existing methods behave in our benchmark, we choose several representative models for single-view KG embedding, as well as JOIE (Hao et al., 2019) and AttH (Chami et al., 2020), which are specially designed for modeling the EC-KG.
Single AttH. AttH (Chami et al., 2020) utilizes the hyperbolic geometry to embed tree-like structures, which is suitable for modeling the concept hierarchy. It also proposes methods to embed logical relations in the hyperbolic space.

Evaluation Metrics
We test the tasks in the form of link prediction. We use two evaluation metrics in these tasks: Mean Reciprocal Rank (MRR). The metric computes the mean reciprocal rank of the correct   instances. If the ranks of correct instances are k i , then the metric computes the average of 1 k i . Hits@N. This metric computes the proportion of the ranks that are no larger than N.
A good model could achieve higher scores on these metrics. We use the "Filtered" setting for all the evaluations, which filters out other true answers from the prediction results to get the final rank for each test case.

Hyperparameter Settings
According to , performances of KGE methods are sensitive to hyperparameters. Following them, we run 30 quasi-random trails for all models from predefined hyperparameter spaces. We list the hyperparameter spaces we use in Appendix A.5. We run all trails for 100 epochs.
For all single-view KE methods, we use the implementations from LibKGE , which utilizes the Ax framework to perform quasi-random hyperparameter search.
For AttH, we use the implementation from the authors 1 . For JOIE, we use the implementation from the authors 2 . We use TransE as the backend and adopt the suggested hyperparameter space from the paper.

Experimental Results
In this section, we provide the experimental results and further propose several future directions.

Knowledge Abstraction
The results of knowledge abstraction are shown in Table 4. From the results, we can see that AttH has 1 https://github.com/HazyResearch/KGEmb 2 https://github.com/JunhengH/joie-kdd19 a large margin beyond other methods on KA-Ins and also performs well on KA-Sub, which demonstrates the effectiveness of hyperbolic embeddings. JOIE outperforms its backend model TransE.
Comparing results between KA-Ins and MKA-Ins, all the models have performance degradation larger than 0.51 on MRR. We conclude that the composition rule in Equation (1) is hard to learn naturally. Among all the models, AttH performs the best on both tasks and has the least degradation from KA to MKA, showing that hyperbolic space has advantages over Euclidean space in knowledge abstraction. However, the degradation is still drastic, showing the difficulty of the MKA task.
Comparing results between KA-Sub and MKA-Sub, most methods also have performance degradation while TransE-based models (TransE and JOIE) have better performances on MKA-Sub, which is interesting for further investigation. AttH performs best on MKA-Sub, which further confirms the advantage of hyperbolic methods.

Knowledge Concretization
The results of knowledge concretization are in Table 5. ComplEx and JOIE performs well on KC-Ins and KC-Sub tasks. Similar to tasks in knowledge abstraction, MKC-Ins and MKC-Sub are also harder for existing models. The results of knowledge concretization tasks are lower than corresponding knowledge abstraction tasks, which shows that knowledge concretization is much harder than knowledge abstraction.

Knowledge Completion
The results of knowledge completion are shown in   on entity-level logical relations while ComplEx is good at dealing with concept-level logical relations. JOIE does not perform well on logical relations. From the comparisons between "Joint" and "Single" settings, we find that results on EGC-Joint are usually higher than results on EGC-Single, which shows that incorporating the concept graph and cross-view links helps the understanding of the entity graph. However, the pattern is not obvious on CGC-Joint and CGC-Single, which may due to that entity triples are far more than concept triples, so models tend to focus more on entity triples.

Overall Results
Finally, we compute an overall KACC score for each method to show their overall performances. Similar to GLUE (Wang et al., 2019a), we average Hits@10 scores of each method on all tasks (except CGC-Single and EGC-Single) to get final scores. We also compute the average scores for knowledge abstraction (KA), knowledge concretization (KCon), and knowledge completion (KCom). In Table 7 we can see that AttH has the best overall score and achieves the highest scores on two tasks. ComplEx also performs well. It is a balanced model since it gets the second place on all tasks. TuckER performs best on knowledge completion. In the future, we plan to test more methods and investigate their abilities.

Analyses and Future Directions
From the results above, we analyze several problems that existing models cannot handle well and propose several promising future directions.
Multi-hop triple modeling. The prediction scores of multi-hop triples are lower than those of one-hop triples, showing the challenge of multihop triple modeling. Besides, how to balance the model to learn from logical and hierarchical relations is also an exciting direction.
Conceptual knowledge completion. Not all models successfully extract conceptual knowledge effectively as their scores of CGC-Joint are lower than those of CGC-Single. The main reason is that KE methods tend to focus more on entity triples due to the losses. They lack the ability to abstract factual knowledge to enrich conceptual knowledge.
Knowledge concretization. The results of concretization tasks are much lower than those of abstraction tasks. It demonstrates that existing models can find proper concepts for entities but cannot find correct entities for concepts. Some solutions may be using contrastive learning to "push" negative entities away from the concepts.
Besides the analyses, there are also several promising future directions of our benchmark.

Contextualized
knowledge embedding. Recently, contextualized knowledge embeddings (Wang et al., 2019b) are proposed to capture different semantics of entities and relations in different contexts. These methods only conduct on the entity graph, while incorporating concepts provides more contextual information for entities. For example, an entity of a painter is more likely to paint than a politician. It is a promising direction to model concepts and entities jointly by contextualized embeddings.

Conclusion
In this paper, we focus on the problems of knowledge abstraction, concretization, and completion. We propose a benchmark to test the abilities of models on KACC. To conduct the evaluation, we construct large-scale datasets with desired properties, and experiments show that tasks in KACC are challenging. For future work, we plan to test more models and design advanced models to address tasks in KACC.

Ethical Considerations
Here we list ethical considerations of our paper: Intellectual property. All of our datasets are collected from Wikidata and Wikidata offers the data for free with no requirement to attribute under Creative Commons CC0 License.
Privacy. Our datasets are collected from an online resource automatically and the collection process does not involve with participants' privacy rights.
Compensation. For the two annotation processes, the salary for annotating each sample is computed according to the average annotation time and local wage standard. And we ensure that all annotators are well paid.
Potiential problems. Though we have manually checked the quality of our datasets and removed meaningless and wrong data, there still may exist false triples. These may lead to wrong predictions in knowledge abstraction, concretization and completion tasks. However, noises are common in human contributed resources such as existing datasets and ours, so the potiental risks are low.

A.1 Annotations for Meaningless Concepts
In this section, we first present our annotation guidelines for annotators, and then we provide the annotation results. Task Guidelines. This task aims to find out meaningless "concepts". For a given instance, you need to check whether it is a "concept". Here are some definitions in this task: • Concept. A word for a group or a class of things, such as "artist", "writer", etc. Humans obtain concepts by abstracting commonalities from things.
We provide you the Wikidata ID, name, and description of an instance. For more details, you can go to the web "https://www.wikidata.org/ wiki/Qxxxx" by replacing "Qxxxx" with the specific Wikidata ID. An example is shown in the following:

ID Name Description
Q68 computer general-purpose device for performing arithmetic or logical operations If an instance is a concept, you should give the correct label. You should give the wrong label in these circumstances: 1. The instance is more like an entity than a concept, such as "Voice over Internet Protocol" (a network protocol).
2. The description and name of the instance are "None".
3. The instance is used for the website's construction and is meaningless, such as "Wikimedia list article" and "Wikimedia disambiguation page".
4. Other cases that are difficult to judge.
Annotation results. We ask human annotators to annotate all concepts in KACC-L. The result of an instance is obtained if two annotators reach the agreement. If not, a third annotator is asked to label the instance. As a result, 482 concepts are removed among 15,642 concepts.

A.2 Annotations for Multi-hop Hierarchical Triples
In this task, we extract multi-hop "instanceOf" and "subclassOf" transitivity links from different train sets and ask annotators to label the position where the hierarchical transitivity holds.
Task Guidelines. This task aims to annotate the transitivity link of concepts. For an example we provided, you need to determine whether semantic drift exists in this link and label the final position that the transitivity holds. Here is some preliminary knowledge: • Concept. A word for a group or a class of things, such as "artist", "writer", etc. Humans obtain concepts by abstracting commonalities from things.
• Transitivity of concepts. An example of a transitivity link of concepts is "scientist → researcher → occupation → human activity". The transitivity link starts from an entity or a concept and follows by concepts. The transitivity of concepts assumes that the semantic of the later concept could contain the former concept. For example, the semantic of "occupation" contains "scientist", while "occupation" is also a more general meaning concept.
• Semantic drift. Because of the annotation process of the original data source (Wikidata), we can assume almost all one-hop links are correct, such as "scientist → researcher" in our example. But semantic drift occurs as the transitivity link goes deep. For example, "scientist" can be subclass of "occupation" while it cannot belong to "human activity". However, the one-hop link "occupation → human activity" still holds true.
We provide you the transitivity links of concepts with length 4. These links start from an entity or a concept and is followed by concepts. We provide you the Wikidata ID and the name of the entities and concepts. For more details, you can go to the web "https://www.wikidata.org/wiki/Qxxxx" by replacing "Qxxxx" with the specific Wikidata ID. Some examples of this task are shown in Table 8:

Links
(1) Q4442912 capital of Russia → Q5119 capital → Q515 city → Q702492 urban area (2) Q3108101 tropical garden → Q1107656 garden → Q386724 work → Q15401930 product You need to label the last position that the concept transitivity holds true starting from the first entity/concept. For example (1) in Table 8, "capital of Russia" can be regarded as a sub-concept of "urban area", so the position is 4. In example (2), "tropical garden" belongs to "garden" while it does not belong to "work", so the position can be labeled as 2.
We can assume that most one-hop links are correct, and you have no need to check the authenticity of them. For example in "Dewey County → county of Oklahoma", you do not need to check whether Dewey is a county of Oklahoma. However, in some specific circumstances, the one-hop link may be wrong, then you can label the case as 1, which means that only the first entitiy/concept is true.
If you cannot find out meaningful names for entities or concepts, or you meet other cases that are difficult to judge, you can label them as 0.
Annotation Results. We extract 1200, 3000 and 6000 multi-hop "instanceOf" and "subclassOf" triples for KACC-S, KACC-M and KACC-L. These numbers are similar to numbers of "subclassOf" triples in corresponding valid and test sets. We ask two annotators to annotate them, and a third annotator will be added if the two annotators do not reach an agreement. Note that our task requires to label the position, thus there are cases where all these three annotators give different labels. In these cases, we just omit these examples. If a case is labeled as 4, then we can construct both 2-hop and 3-hop triples from the link. If the case is labeled as 3, we can only obtain the 2-hop triple. The statistics of our datasets are in Table 9.

A.3 Statistics of Dataset Split
The statistics of the datasets after partition are shown in Table 10.

A.4 Additional Domain Plot for KACC
We plot the domains of our KACC-S and KACC-M in Figure 5. Domains of KACC-S and KACC-M are similar while KACC-M has more fine-grained concepts, such as "town in China" and "commune of France".

A.5 Hyperparameter Settings
In this section, we present our hyperparameter selection methods in detail. We run 30 quasi-random hyperparameter search trails on predefined hyperparameter spaces for different baselines (see Table 11 to Table 13). Because we use different implementations, thus hyperparameter spaces are different for different methods.
We run each trail for 100 epochs and save the checkpoint every 20 epochs (150 saved checkpoints for one model in total). Since our benchmark contains multiple tasks, for each task, we use the corresponding valid set to choose the best checkpoint based on the MRR metric, and then we test the selected checkpoint on the test set and compute final metrics.

A.6 Runtime Environment
All experiments are conducted on a server with the following environment.