GreenKGC: A Lightweight Knowledge Graph Completion Method

Knowledge graph completion (KGC) aims to discover missing relationships between entities in knowledge graphs (KGs). Most prior KGC work focuses on learning embeddings for entities and relations through a simple score function. Yet, a higher-dimensional embedding space is usually required for a better reasoning capability, which leads to larger model size and hinders applicability to real-world problems (e.g., large-scale KGs or mobile/edge computing). A lightweight modularized KGC solution, called GreenKGC, is proposed in this work to address this issue. GreenKGC consists of three modules: representation learning, feature pruning, and decision learning, to extract discriminant KG features and make accurate predictions on missing relationships using classifiers and negative sampling. Experimental results demonstrate that, in low dimensions, GreenKGC can outperform SOTA methods in most datasets. In addition, low-dimensional GreenKGC can achieve competitive or even better performance against high-dimensional models with a much smaller model size.


Introduction
Knowledge graphs (KGs) store human knowledge in a graph-structured format, where nodes and edges denote entities and relations, respectively. The (head entity, relation, tail entity) factual triple, represented by (h, r, t), is the basic component of KG. In many knowledge-centric artificial intelligence (AI) applications, such as question answering [Huang et al., 2019, Saxena et al., 2020, information extraction [Hoffmann et al., 2011, Daiber et al., 2013, and recommendation [Wang et al., 2019, Xian et al., 2019, KG plays an important role as it provides explainable reasoning paths to predictions. However, most KGs suffer from the incompleteness problem; namely, a large number of factual triples are missing, leading to information loss in downstream applications. Thus, there are growing interests in developing KG completion (KGC) methods to solve the incompleteness problem by inferring undiscovered factual triples based on existing ones.
Knowledge graph embedding (KGE) methods have been widely used to solve the incompleteness problem. Embeddings for entities and relations are stored as model parameters and updated by maximizing triple scores among while minimizing those among negative triples. The number of parameters in a KGE model is linear to the embedding dimension and the number of entities and relations in KGs, i.e. O((|E| + |R|)d), where |E| is the number of entities, |R| is the number of relations, and d is the embedding dimension. Since KGE models usually require a higher-dimensional embedding space for a better reasoning capability, they require large model sizes (i.e. parameter numbers) to achieve arXiv:2208.09137v1 [cs.AI] 19 Aug 2022 Figure 1: The MRR performance versus the number of parameters of three KGE methods against the FB15K-237 dataset (left) and the YAGO3-10 dataset (right). When a model has a smaller size, its performance is poorer. Also, the larger dataset, YAGO3-10, demands more model parameters than the smaller dataset, FB15k-237, to achieve satisfactory results.
satisfactory performance as demonstrated in Fig. 1. To this end, it is challenging for them to handle large-scale KGs with lots of entities and relations in resource-constrained platforms such as mobile/edge computing. A KGC method that has good reasoning capability in low dimensions is desired.
The requirement of high-dimensional embeddings on popular KGE methods is due to over-simplified score functions [Xiao et al., 2015]. Classification-based KGC methods, such as ConvE [Dettmers et al., 2018], aim to increase the reasoning capabilities in low dimensions by adopting deep neural networks (DNNs) as powerful decoders. They can achieve satisfactory performance in low dimensions but it is not clear what logical patterns in KGs can be well modeled by embeddings and whether decoders can be end-to-end optimized. In addition, DNNs demand longer inference time and more computation power than popular KGE methods. Recently, DualDE [Zhu et al., 2022] applied Knowledge Distillation [Hinton et al., 2015] to a high-dimensional KGE so as to train powerful low-dimensional embeddings. Yet, it demands three stages of embedding training: 1) training high-dimensional KGE, 2) training low-dimensional KGE with the guidance of high-dimensional KGE, and 3) multiple rounds of student-teacher interactions. Its training process is time consuming and unstable. It fails to converge easily because of the latter.
Here, we propose a new KGC method that works well under low dimensions and name it GreenKGC. GreenKGC consists of three modules: 1) representation learning, 2) feature pruning, and 3) decision learning. Each of them is trained independently. In Module 1, we leverage a KGE method, called the baseline method, to learn high-dimensional entity and relation representations. In Module 2, a feature pruning process is applied to the high-dimensional entity and relation representations to yield discriminant low-dimensional features for triples. Since logical patterns for different relations vary a lot, we partition triples into several relation groups of homogeneous logical patterns. Thus, relations in the same group can share the same low-dimensional features. It helps improve the performance and reduces the model size. In Module 3, we train a binary classifier for each relation group so that it can predict triple's score in inference. The predicted score is a soft value between 0 and 1, which indicates the probability whether a certain triple exists or not.
In addition, we propose two novel negative sampling schemes, embedding-based and ontology-based, for classifier training in this work. They are used for hard negative mining. These hard negatives cannot be correctly predicted by the baseline methods. Finally, we conduct extensive experiments and compare the performance and model sizes of GreenKGC and several representative KGC methods on two KGC tasks -link prediction and triple classification. Experimental results show that GreenKGC can achieve impressive performance with a model size that is smaller than that of state-of-the-art benchmarking KGC methods.

KGE Methods
Based on the score functions, KGE models can be categorized into diatance-based and semantic-mathcing-based [Wang et al., 2017]. Distance-based KGE methods model relations as linear transformations from head to tail entities. TransE [Bordes et al., 2013] is one of the distance-based KGE methods. It models relations as translations. Though it can model compositionality among different relations well, it fails to model some logical patterns such as symmetry. RotatE [Sun et al., 2019a] learns embeddings in a complex embedding space and models a relation as a rotation to improve model's expressiveness. Recent work has tried to model relations as scaling [Chao et al., 2020] and reflection  transformations to model particular relation patterns. Score functions in semantic-matching-based KGE methods, such as RESCAL [Lin et al., 2015], calculated the similarities among triples in the embedding space. DistMult [Bordes et al., 2014] adopts diagonal matrices to model relations to reduce model complexity. ComplEx [Trouillon et al., 2016] extends DistMult to a complex space for better model expressiveness, especially for asymmetric relations. TuckER [Balažević et al., 2019b] introduces a core tensor to allow more flexibilities in modeling relations. KGE models can capture logical patterns well when the embedding dimension is sufficiently large. Yet, they cannot offer satisfactory performance in a low-dimensional embedding space due to over-simplified score functions. In this work, we focus on applying GreenKGC on distance-based KGE as they are interpretable on what relation patterns to be modeled to achieve good performance in low dimensions.

Classification-based KGC Methods
Existing classification-based KGC methods adopt DNNs as powerful decoders for KGC tasks. NTN [Socher et al., 2013] adopts a neural tensor network combined with textual representations of entities. ConvKB [Nguyen et al., 2017] uses 1 × 3 convolutional filters followed by several fully connected (FC) layers to predict triple scores. ConvE [Dettmers et al., 2018] reshapes entity and relation embeddings into 2D images, and uses 3 × 3 convolutional filters followed by several FC layers to predict the scores of triples. Both ConvKB and ConvE can retain similar performance in the low-dimensional space as that in the high-dimensional space. Since embeddings in DNNs are optimized endto-end, the correspondidng relation modeling is not transparent. In addition, their inference time is much longer. In contrast, GreenKGC proposed in this work adopts a modularized design, which is more interpretable. Its computational complexity is lower and its inference time is shorter.

Low-dimensional KGE Methods
Recently, research on the design of low-dimensional KGE methods has received attention. MuRP [Balažević et al., 2019a] embeds entities and relations in a hyperbolic space due to its effectiveness to model hierarchies in KGs. AttH [Chami et al., 2020] improves hyperbolic KGE by leveraging hyperbolic isometries to model logical patterns. MulDE [Wang et al., 2021] adopts Knowledge Distillation [Hinton et al., 2015] on a set of hyperbolic KGE as teachers to learn powerful embeddings in low dimensions. Although hyperbolic KGE can achieve good performance in low dimensions, they are not convenient to use in downstream tasks as their geometric properties are not compatible with machine learning models and optimization derived in the Euclidean space. In Euclidean space, DualDE [Zhu et al., 2022] adopts Knowledge Distillation on high-dimensional KGE for smaller model size and faster inference time. Although it performs well in low dimensions, it requires training new low-dimensional embeddings from scratch. GreenKGC has two clear advantages over existing low-dimensional KGE. First, it fully operates in the Euclidean space. Second, it does not need to train new low-dimensional embeddings from scratch.

Methodology
GreenKGC is presented in this section. It consists of three modules: representation learning, feature pruning, and decision learning, to obtain discriminant low-dimensional triple features and predict triple scores accurately. An overview of GreenKG is given in Fig. 2. Details of each module will be elaborated below.

Representation Learning
We leverage existing KGE models to obtain initial embeddings of entities and relations, where their embedding dimensions can be high to be more expressive. Yet, the initial embedding dimension will be largely reduced in the feature pruning module. In general, GreenKGC can build upon any existing KGE models. We refer the KGE models used in GreenKGC as our baseline models. Details training procedure for this module is given in the appendix.

Feature Pruning
High-dimensional KG representations are pruned into low-dimensional discriminant KG features in this module.
KG partitioning. Since relations in KGs could have different logical patterns (e.g. symmetric v.s. asymmetric), we first partition them into disjoint sets, where relations in each set have similar properties. To achieve this objective, we examine the relation clusters in the embedding space. It is our expectation that relations of similar logical patterns should be closer to each other in the embedding space. We visualize embeddings for relations in FB15k-237 using t-SNE [Van der Maaten and Hinton, 2008] in Fig. 6. We observe several relation clusters in the embedding space, which is consistent with our expectation. Thus, we apply k-Means clustering to relation embeddings, where k is the number of clusters. A smaller k will result in worse performance while a larger k will demand more parameters in decision learning and may result in overfitting.
Discriminant Feature Test (DFT). DFT is a supervised feature selection method recently proposed in . All training samples have a high-dimensional feature set as well as the corresponding labels. DFT scans through each dimension in the feature set and computes its discriminability based on sample labels. DFT can be used to reduce the dimensions of entity and relation embeddings while preserving their power in downstream tasks such as KGC.
Here, we need to extend DFT to the multivariate setting since there are multiple variables in each triple. For example, TransE [Bordes et al., 2013] has 3 variables (i.e. h, r, and t) in each feature dimension. On the other hand, RotatE [Sun et al., 2019a] has 5 variables in each feature dimension as both h and t are embedded in the complex domain with two variables per dimension. Fig. 4 illustrates how DFT selects discriminant features in the KGC context. First, for each dimension, we conduct the principal component analysis (PCA) and map multiple variables to a single variable (i.e., the first principal component) for each triple.

Mathematically, we have a collection of concatenated triple variables in a single dimension
is the concatenation of head, relation, and tail variables. |T | is the total number of triples, and n v is the number of triple variables per dimension, e.g. n v = 3 for TransE and n v = 5 for RotatE. We adopt PCA to reduce the dimension of V to 1-D by solving the singular value decomposition (SVD) in form of Figure 4: Illustration of applying DFT to KGC tasks. For binary classification, if positives and negatives can be easily separated in a particular dimension, this dimension will be selected as a discriminant one.
where Σ is the largest eigenvalue in SVD. W ∈ R nv×1 is the learned subspace basis to transform triple variables v ∈ R nv to scalars for feature pruning. A single-dimensional triple feature e i can be obtained by e i = v i W .
Afterward, we can apply the standard DFT to each dimension since each dimension has a scalar value after the above PCA step. DFT adopts cross-entropy to evaluate the discriminant power of each dimension as cross-entropy is a typical loss for binary classification. Dimensions with lower cross-entropy imply higher discriminant power. We prune the feature dimensions with the lowest DFT errors to obtain low-dimensional features.

Decision Learning
We formulate KGC tasks as a binary classification problem in each relation group. We adopt binary classifiers as decoders since they are more powerful than distance-based score functions. The binary classifiers take pruned triple features as the inputs and predict the soft probability (between 0 and 1) of the triple as the output. We also conduct classifier training with negative sampling so as to train a powerful classifier.
Binary classification. The binary classifiers take a low-dimensional triple feature e i and predict a soft labelŷ i . The label y i = 1 for the observed triples and y i = 0 for the sampled negatives. We train a binary classifier g( * ), which returns a soft decision between 0 and 1, by minimizing the following negative log-likelihood loss where g(t i ) is the final triple score. In general, we select a nonlinear classifier to accommodate nonlinearity in sample distributions.
Negative sampling. It is desired to mine hard negative cases in original KGE models. We propose two negative sampling schemes to train a more powerful classifier. First, most KGE models can only capture the coarse entity type information. For example, they may predict a location given the query (Mary, born_in, ?) yet without an exact answer. Thus, we draw negative samples within the entity types constrained by the relation [Krompaß et al., 2015] to enhance the capability to predict the exact answer. Such a negative sampling scheme is called ontology-basd negative sampling. We also investigate the sampling of hard negatives that cannot be trivially obtained from original KGE methods. Negatives with higher embedding scores f r (h i , t i ) tend to be predicted wrongly in the KGE methods. To handle it, we rank all randomly sampled negative triples and select the ones with higher embedding scores as hard negatives for classifier training. Such a negative sampling is called embedding-based negative sampling.

Experiments
We evaluate GreenKGC on two KGC tasks: link prediction and triple classififcation. For link prediction, we would like to answer the following questions: • Whether GreenKGC can outperform SOTA KGC methods in lower dimensions?
• Whether GreenKGC can save the number of parameters as compared with original KGE methods while retaining comparable or better performance? • Whether GreenKGC can outperform other classification-based methods with fewer parameters and in shorter inference time? • Whether the proposed feature pruning scheme is effective? • Whether the two proposed negative sampling methods for classifier training are effective, and which types of KGs are they suitable for?
Also, we would like to demonstrate that GreenKGC can be generalized to more tasks such as triple classification.

Experimental Setup
Datasets. We consider four widely-used KG datasets for performance benchmarking: FB15k-237 [Bordes et al., 2013, Toutanova andChen, 2015], WN18RR [Bordes et al., 2013, Dettmers et al., 2018, YAGO3-10 [Dettmers et al., 2018], and CoDEx [Safavi and Koutra, 2020]. Their statistics are summarized in Table 1. FB15k-237 is a subset of Freebase [Bollacker et al., 2008] that contains real-world relationships. WN18RR is a subset of WordNet [Miller, 1995] Table 3: Results on the link prediction task, where we show the performance gain (or loss) in terms of percentages with an up (or down) arrow and the ratio of the model size within the parentheses against those of respective 500-D models.
containing lexical relationships between word senses. YAGO3-10 is a subset of YAGO3 [Mahdisoltani et al., 2014] that describes the attributes of persons. CoDEx is extracted from Wikidata [Vrandečić and Krötzsch, 2014]. It is further divided into three subsets based on the sizes and relation patterns: CoDEx-S, CoDEx-M, and CoDEx-L. CoDEx-S and CoDEx-M are used for triple classification while CoDEx-L is used for link prediction. In CoDEx-S and CoDEx-M, there are curated hard negatives for triple classification to evaluate model's capability in identifying hard negatives.
Implementation details. We select TransE [Bordes et al., 2013] and RotatE [Sun et al., 2019a] as the baseline KGE to learn initial representations for entities and relations. We train 500-D TransE and RotatE using the online codes 1 . We consider several tree-based binary classifiers, including Decision Trees [Breiman et al., 2017], Random Forest [Breiman, 2001], and Gradient Boosting Machines [Chen and Guestrin, 2016], and adopt the one that achieve the best results in the validation set. We also search for the best hyper-parameters based on the validation set. They are: tree depth d ∈ {3, 5, 7}, tree number n ∈ {100, 300, 500, 700, 1000}, and learning rate lr ∈ {0.05, 0.1, 0.3}. Results for other methods in low dimensions are either taken from previous low-dimensional KGE papers [Chami et al., 2020, Zhu et al., 2022 or, if not presented, trained by ourselves using publicly available implementations with hyperparameter search suggested by the original paper. Triple classification results for benchmarking methods are taken from Safavi and Koutra [2020] and the numbers of parameters are calculated based on the best configuration in the paper.
Evaluation metrics. For the link prediciton task, the goal is to predict the missing entity given a query triple, i.e. (h, r, ?) or (?, r, t). The correct entity should be ranked higher than other candidates. Here, several common ranking metrics are used such as MRR (Mean Reciprocal Rank) and Hits@k (k=1, 3, 10). Following the convention in Bordes et al. [2013], we adopt the filtered setting, where all entities serve as candidates except for the ones that have been seen in training, validation, or testing sets. For the triple classification task, the goal is to predict the plausibility (i.e. 0 or 1) of a query triple, (h, r, t). Same as prior work, we find the optimal score threshold for each relation using the validation set, apply it to the testing set, and use accuracy and the F1 score to evaluate the results.

Link Prediction
Results in low dimensions. In Table 2, we compare GreenKGC with KGE, classification-based, and low-dimensional KGE methods in low dimensions. KGE methods cannot achieve good performance due to over-simplified score functions. Classification-based methods achieve performance better than KGE methods as they adopt DNNs as complex decoders. Low-dimensional KGE methods provide state-of-the-art KGC solutions in low dimensions. Yet, GreenKGC outperforms them in FB15k-237 and YAGO3-10 in all metrics. For WN18RR, the baseline KGE methods perform poorly. Since GreenKGC relies on KGE as its input, this affects the performance of GreenKGC. However, GreenKGC still outperforms all KGE and classification-based methods in WN18RR.
We show the performance curves of various methods as a function of embedding dimensions in Fig. 5. We see that the performance of KGE methods (i.e. TransE and RotatE) drop significantly as the embedding dimension is lower. Although classification-based methods (i.e., ConvE and ConvKB) are less influenced by lower dimensions, their performance still degrades as the dimension is lower than 64-D. GreenKGC and AttH are the two methods that can offer reasonable performance as the dimension goes to as low as 8-D.  Comparison with baseline KGE. One unique characteristics of GreenKGC is to prune a high-dimensional KGE into low-dimensional triple features and make prediction with a binary classifier as a powerful decoder. We evaluate the capability of GreenKGC in saving the number of parameters and maintaining the performance by pruning original 500-D KGE to 100-D triple features in Table 3. As shown in the table, GreenKGC can achieve comparable or even better performance with around 5 times smaller model size. Especially, Hits@1 are retained the most and even improved compared to the high-dimensional baselines. It shows that our negative sampling for classifier training can correct some failed cases in the baseline methods. In addition, GreenKGC using TransE as the baseline can outperform highdimensional TransE in all datasets. Since the TransE score function is simple and fails to model some relation patterns, such as symmetry and 1-to-N, incorporating TransE with a powerful decoder, i.e. binary classifier, in GreenKGC successfully overcomes deficiencies in the over-simplified score function. For all datasets, GreenKGC-100 could generate better results than 100-D KGE models.
Comparison with classification-based methods. We benchmark GreenKGC with two other classification-based methods in Table 4 in terms of performance, the model size and the inference time. They are ConvKB [Nguyen et al., 2017] and ConvE [Dettmers et al., 2018]. GreenKGC outperforms both in performance in almost all cases. As compared with ConvKB, GreenKGC achieves significantly better performance with slightly more parameters. As compared with ConvE, GreenKGC uses fewer parameters, demands shorter inference time, and offers better performance since ConvE adopts a deep architecture.
Ablation study. We evaluate the effectiveness of the feature pruning scheme in GreenKGC in Table 5. We use "w/o pruning" to denote the original 32-D KGE directly followed by the decision learning module. Also, we compare three feature pruning schemes. They are: 1) random selection, 2) selecting dimensions with the highest DFT errors (i.e. the least discriminant ones) and 3) selecting dimensions with the lowest DFT errors (i.e. the most discriminant ones). As shown in the table, selecting the most discriminant features achieve the best results and selecting the least discriminant features achieve the worst results in both datasets. Furthermore, the proposed feature pruning scheme is effective since it is significantly better than the one that does not use it.     Table 7: Results on triple classification datasets.
We also evaluate the effectiveness of the two proposed negative sampling (i.e., ontology-and embedding-based) methods in Table 6. In FB15k-237, both are more effective than randomly drew negative samples. The ontology-based one gives better results than the embedding-based one. In WN18RR, the embedding-based one achieves the best results. Since there is no clear definition of "types" in WordNet, the ontology-based one performs worse than the randomly drew one. We can conclude that, to correct failure cases in the baseline KGE, ontology-based negative sampling is effective for instance KGs such as FB15k-237 while embedding-based negative sampling is powerful for concept KGs such as WN18RR.

Triple Classification
Results on triple classification are shown in Table 7. We reduce the original 512-D KGE to 128-D triple features in GreenKGC. Again, we see that GreenKGC is able to achieve comparable or even better performance with much fewer parameters. It is worthwhile to emphasize that, since the number of parameters in the classifier is invariant to the size of the dataset, GreenKGC will have more savings in parameters in larger datasets (e.g., CoDEx-M) than smaller datasets (e.g., CoDEx-S). In addition, GreenKGC is able to outperform other methods in CoDEx-M, where composition and symmetry are the two most prevalant relation patterns [Safavi and Koutra, 2020], with a smaller model size.

Conclusion and Future Work
A lightweight KGC method, called GreenKGC, was proposed in this work. It consists of three modules: representation learning, feature pruning, and decision learning. GreenKGC has several advantages over existing KGC methods. First, it only requires a low-dimensional representation space (e.g. 8-D) to achieve satisfactory performance. Thus, its model size is very small. Second, as compared with other classification-based methods, GreenKGC requires shorter inference time and provides better performance. Third, as compared with knowledge-distillation-based methods, the feature pruning process of GreenKGC does not train new embeddings but extracts the most discriminant features in the original KGE. The latter is significantly simpler.
In the future, we plan to extend the green pipeline to more KG applications such as entity typing  and entity alignment [Sun et al., 2018]. Note that the feature pruning process is supervised. It can select a different subset of features based on different tasks and labels. Besides, it is interesting to explore more negative sampling ideas for different applications.

Training Procedure for High-dimensional KGE Models
To train the high-dimensional KGE model as the initial entity and relation representations, we adopt the self-adversarial learning process as proposed in Sun et al. [2019a]. That is, given an observed triple (h, r, t) and the KGE model f r (h, t), we minimize the following loss function where (h i , r, t i ) is a negative sample and and where α is the temperature to control the self-adversarial negative sampling. We summarize the score functions for some common KGE models and their corresponding number of variables per dimension n v in Table 8. In general, GreenKGC can build upon any existing KGE models.
Model • denotes the Hadamard product, and ·, ·, · is the generalized dot product. n v is the number of triple variables in one dimension.

KG Partitioning
To verify the idea of relation clusters in the embedding space for KG partitioning, we show the t-SNE visualization of relation embeddings in CoDEx-L and YAGO3-10 in Fig. 6 (a) and (b), respectively. Relations within the same cluster are assigned with the same color. We do observe the clustering structure in the t-SNE plot. We list the k-Means results on WN18RR relation embeddings in Table 9. Without categorizing relations into different logical patterns explicitly, relations of similar patterns are still clustered together in the relation embedding space. For example, most relations in cluster #0 are symmetric ones. All relations in cluster #1 are N-to-1. The remaining two relations in cluster #2 are those with the highest tail-per-head ratio.   We evaluate model performance on WN18RR with cluster number k in Fig. 7. We see that, when k is too small or too large, the performance is not good. When k is too small, relations with different patterns are clustered together. Then, the DFT algorithm fails to draw suitable feature dimensions for all relations in the cluster. When k is too large, the number of model parameters is larger since one classifier is assigned to each relation group. In addition, there could be few relations in some relation groups and the classifier may overfit. In general, we adopt k = 3 for all datasets as it gives reasonable performance with a small model size.

DFT Implementation Details
To calculate the discriminant power of each dimension, we iterate through each dimension in the high-dimension feature set and calculate the discriminant power based on sample labels. More specifically, we model KGC as a binary classification task. We assign label y i = 1 to the ith sample if it is an observed triple and y i = 0 if it is a negative sample. For the dth dimension, we split the 1-D feature space into left and right subspace and calculate the cross-entropy where N L and N R are the number of samples in the left and right intervals, respectively, and where P L,1 = 1 N L N L i=1 y i , and P L,0 = 1 − P L,1 and similarly for P R,1 and P R,0 . A lower cross-entropy value implies higher discriminant power. The cross-entropy value is plotted as a curve of the rank-ordered dimension (from the lowest to the highest) in Fig. 8 when pruning 512-D RotatE on FB15k-237. As shown in the figure, the slope of the curve changes drastically and a "shoulder point" occurs at around d = 100. In general, we can get good performance in low dimensions as long as we preserve dimensions lower than the shoulder point of the curve and prune all other dimensions. Fig. 9 shows histograms of PCA-transformed 1-D triple variables in two different feature dimensions. As seen in the figure, samples in Fig. 9 (a), i.e. the feature dimension with the lower cross-entropy, are more separable than that in Fig. 9 (b), i.e. the feature dimension with the higher cross-entropy. Therefore, a lower cross-entropy implies a more discriminant feature dimension.  Figure 11: Prediction distribution of a query (38th Grammy Awards, award_winner, ?) in FB15k-237. A higher predicted score implies a higher chance to be a valid triple.

Best Configurations
RotatE can model many relation patterns while XGBoost is the most powerful classifier. Thus, they are adopted by all datasets in our experiments. With the RotatE as the baseline KGE and the XGBoost classifier, the best configuration for each dataset is listed in Table 10. The embedding-based negative sampling is effective for WN18RR while the ontology-based negative sampling is effective for FB15k-237. For other datasets, they achieve similar performance. In the implementation, we adopt embedding-based negative sampling as it is more efficient. Baseline KGEs are trained on a single NVIDIA Tesla V100 GPU with 16 GB of RAM, while binary classifiers are trained using AMD EPYC 7542 CPUs.

Triple Classification Results in Three Representative Relations
We compare predictions from GreenKGC and KGE methods with scatter plots in Fig. 10, where the vertical axis shows the scores predicted by GreenKGC and the horizontal axis shows the scores from KGE. As shown in the figure, there are many samples lying between 0.2 and 0.6 with KGE predictions. The overlapping of positive and negative samples in that interval makes the binary classification task more challenging. In contrast, predictions from GreenKGC are closer to either 0 or 1. Thus, it is easier for GreenKGC to differentiate positive samples from the negative samples. This is especially true for symmetric relations such as spouse and sibling. They support our methodology in classification-based link prediction, where Hits@1 can be improved significantly.

Prediction Distribution
It was reported in Sun et al. [2019b] that the predicted scores for all candidates on FB15k-237 are converged to 1 with ConvKB [Nguyen et al., 2017]. This is unlikely to be true given the fact that KGs are often highly sparse. The issue is resolved after ConvKB being implemented using PyTorch 2 , but the performance on FB15k-237 is still not as good as ConvKB originally reported in the paper. The issue shows the problem of end-to-end optimization. That is, it is difficult to control and monitor every component in the model. This urges us to examine whether GreenKGC has the same issue. Fig. 11 shows the sorted predicted scores of a query (38th Grammy Awards, award_winner, ?) in FB15k-237. We see from the figure that only very few candidates have positive scores close to 1 while other candidates receive the negative scores 0. The former are valid triples. The score distribution is consistent with the sparse nature of KGs.