To Copy Rather Than Memorize: A Vertical Learning Paradigm for Knowledge Graph Completion

Embedding models have shown great power in knowledge graph completion (KGC) task. By learning structural constraints for each training triple, these methods implicitly memorize intrinsic relation rules to infer missing links. However, this paper points out that the multi-hop relation rules are hard to be reliably memorized due to the inherent deficiencies of such implicit memorization strategy, making embedding models underperform in predicting links between distant entity pairs. To alleviate this problem, we present Vertical Learning Paradigm (VLP), which extends embedding models by allowing to explicitly copy target information from related factual triples for more accurate prediction. Rather than solely relying on the implicit memory, VLP directly provides additional cues to improve the generalization ability of embedding models, especially making the distant link prediction significantly easier. Moreover, we also propose a novel relative distance based negative sampling technique (ReD) for more effective optimization. Experiments demonstrate the validity and generality of our proposals on two standard benchmarks. Our code is available at https://github.com/rui9812/VLP.


Introduction
Knowledge graphs (KGs) structurally represent human knowledge as a collection of factual triples. Each triple (h, r, t) represents that there is a relation r between head entity h and tail entity t. With the massive human knowledge, KGs facilitate a myriad of downstream applications (Xiong et al., 2017). However, real-world KGs such as Freebase (Bollacker et al., 2008) are far from complete (Bordes et al., 2013). This motivates substantial research on the knowledge graph completion (KGC) task, i.e., automatically inferring missing triples. As an effective solution for KGC, embedding model learns representations of entities and relations with pre-designed relation operations. For example, TransE (Bordes et al., 2013) represents relations as translations between head and tail entities. RESCAL (Nickel et al., 2011), DistMult  and ComplEx (Trouillon et al., 2016) model the three-way interactions in each triple. RotatE (Sun et al., 2019), QuatE (Zhang et al., 2019) and DualE (Cao et al., 2021) represent relations as rotations in different dimensions. Rot-Pro (Song et al., 2021) further introduces the orthogonal projection for each relation.
Essentially, embedding models learn structural constraints for every factual triple during the training period. For example, for each training triple (h, r, t), TransE constrains that the head embedding h plus the relation embedding r equals the tail embedding t. Such single-triple constraints empower embedding models to implicitly perceive (i.e., memorize) the high-order entity connections and intrinsic relation rules (Sun et al., 2019). As shown in Figure 1, by imposing the structural constraints (e.g., h + r = t in TransE) on the five training triples, embedding models can memorize the entity connection (x, r 1 ∧ r 2 , z) and the relation rule r 1 ∧ r 2 → r. In this way, the missing link Model Score Function g(Wr,1h + br, Wr,2t) Space Wr,1 br Wr,2 g(q, k) RESCAL (Nickel et al., 2011) h ⊤ Wrt I 0 Wr q ⊤ k R TransE (Bordes et al., 2013) −∥h + r − t∥ I r I −∥q − k∥ R TransR (Lin et al., 2015) −∥Wrh + r − Wrt∥ Wr r Wr −∥q − k∥ R DistMult  h ⊤ diag(r)t diag(r) 0 I q ⊤ k R ComplEx (Trouillon et al., 2016) Re(h ⊤ diag(r)t) diag(r) 0 I Re(q ⊤ k) C RotatE (Sun et al., 2019) −∥h • r − t∥ diag(r) 0 I −∥q − k∥ C (x, r, z) can be inferred at test time without any explicit prompt. We refer to this single-triple learning paradigm as Horizontal Learning Paradigm (HLP), since the relation rules are implicitly induced by the horizontal paths between head and tail entities. However, this paper shows that the HLP-based embedding models are hard to reliably memorize the multi-hop relation rules, which is attributed to inevitable single-triple bias and high-demanding memory capacity. The unreliable multi-hop relation rules in the implicit memory cannot serve as rational basis for prediction, leading to the inferior performance of embedding models in predicting links between distant entity pairs. This brings us a question: is there a general paradigm for embedding models to alleviate this problem of HLP and achieve superior performance?
We give an affirmative answer by presenting Vertical Learning Paradigm (VLP), which endows embedding models with the ability to explicitly consult related factual triples (i.e., vertical references) for more accurate prediction. Specifically, to answer (h, r, ?), VLP first selects N relevant reference queries in the training graph, and then treats their ground-truth entities as the reference answers for embedding models to jointly predict the target t. This learning process can be viewed as an explicit copy strategy, which is different from the implicit memorization strategy of HLP, making it significantly easier to predict distant links. Moreover, to effectively optimize the models, we further propose a novel Relative Distance based negative sampling technique (ReD), which can generate more informative negative samples and reduce the toxicity of false negative samples. Note that VLP and ReD are both general techniques and can be widely applied to various embedding models. Our contributions are summarized as follows: • We show that existing embedding models underperform in predicting links between distant entity pairs, since they are hard to reliably memorize the multi-hop relation rules.
• We present a novel learning paradigm named VLP, which can empower embedding models to leverage explicit references as cues for more accurate prediction.
• We further propose a new relative distance based negative sampling technique named ReD for more effective optimization.
• We conduct in-depth experiments on two standard benchmarks, demonstrating the validity and generality of the proposed techniques.

Preliminaries
To elicit our proposal from a general paradigm perspective, we give a bird's eye view of existing embedding models in this section. We first review the problem setup of KGC task. Afterwards, we summarize a generalized score function of embedding models and describe how the models learn to predict new links (i.e., horizontal learning paradigm).

Problem Setup
Given the entity set E and relation set R, a knowledge graph can be formally defined as a collection of factual triples D = {(h, r, t)}, in which head/tail entities h, t ∈ E and relation r ∈ R. KGC task aims to infer new links by answering a query (h, r, ?) or (?, r, t). As an effective tool for this task, embedding model learns representations of entities and relations to measure each candidate's plausibility with a pre-designed score function.
Given a query (h, r, ?) and a candidate answer t, GSF first maps the head embedding h ∈ X de to the query embedding q ∈ X dr with a relation-specific linear transformation: where X ∈ {R, C} is the embedding space, d e and d r are the embedding dimensions of entities and relations, W r,1 ∈ X dr×de and b r ∈ X dr denote the relation-specific projection matrix and bias vector. Then, GSF uses another linear function to generate the answer embedding k ∈ X de from the tail embedding t ∈ X de : where W r,2 ∈ X dr×de denotes the relation transformation matrix for tail projections. Finally, the plausibility score of the triple (h, r, t) is calculated by a similarity function g: score = g(q, k). (3) By combining the above three steps, we formally define the generalized score function f g as follows: With different choices of W r,1 , b r , W r,2 and g, GSF can be instantiated as specific score functions of existing models. Table 1 exhibits several popular methods and their corresponding GSF settings.

Horizontal Learning Paradigm
With the pre-defined score functions, embedding models commonly follow the horizontal learning paradigm, which constructs the single-edge constraints to implicitly memorize high-order entity connections and intrinsic relation rules.
Take RotatE to process the triples in Figure 1 as an example. By imposing the rotation constraints on three triples (a, r 1 , b), (b, r 2 , c) and (a, r, c), Ro-tatE is able to perceive a two-hop entity connection and further induce a two-hop relation rule: Similarly, the high-order connection can also be captured by constraining (x, r 1 , y) and (y, r 2 , z): Finally, by combining Equation (5) and (6), RotatE is capable of inferring the missing link (x, r, z).

Motivation
The motive of our work originates from an observation that embedding models underperform in predicting links between distant entity pairs (refer to Appendix A for more details). Since the effectiveness of embedding models is largely determined by the ability to learn intrinsic relation rules (Sun et al., 2019;Song et al., 2021;, such inferior performance reveals that the models are hard to memorize the multi-hop relation rules. We attribute this deficiency to the multi-hop bias accumulation and high-demanding memory capacity in the implicit memorization strategy of HLP.

Multi-hop Bias Accumulation
The HLP-based embedding models implicitly perceive the multihop relation rules by constraining each training edge as shown in Section 2.3. Nevertheless, the single-edge constraints inevitably have biases during the optimization, which will accumulate with the increase of relation hops. This bias accumulation makes the memorized relation rules unreliable, leading to the deficient generalization ability for link prediction between distant entities. Concretely, considering the single-edge biases, the rule learning process in Equation (5) can be rewritten as: where ϵ abc = ϵ −1 0 • ϵ 1 • ϵ 2 is the cumulative bias. Note that ϵ abc is triple-dependent, which makes it intractable for other queries, e.g., (x, r, ?) in Figure  1, to rely on this rule for prediction.
High-demanding Memory Capacity The HLPbased models essentially learn the general rules from the relation paths between head and tail entities. With the increase of path length, the quantity of different paths (or rules) expands exponentially . This requires intensive memory to memorize the whole crucial relation rules. However, the modeling capacity of embedding models is insufficient to meet this requirement. Since these models constrain basic edges to form long-range paths following the bottom-up design of HLP, they are more inclined to memorize the low-order rules and forget the high-order rules.
Design Goal We seek to develop a general technique to alleviate the "Hard to Memorize" problem of existing embedding models.

Reference Aggregation
Candidate Query Score Figure 2: Vertical learning paradigm consisting of reference query selection, reference graph construction and reference answer aggregation.
A straight-forward strategy is to directly extract and process the enclosing subgraph between head and tail entities (Teru et al., 2020), which can avoid the multi-hop bias accumulation. However, such a sophisticated procedure needs to be executed once for each candidate triple, which brings enormous training and test time costs. For example, GraIL (Teru et al., 2020) takes about 1 month to infer on the full FB15k-237 test set (Zhu et al., 2021). Moreover, the enclosing subgraph extraction is also constrained by the path length, severely harming the performance of link prediction.
Therefore, this paper aims to propose a general framework which can: (1) alleviate the deficiency of HLP; (2) enjoy the merits of validity and generality with tractable computational costs.

Vertical Learning Paradigm
Inspired by the notion that "to copy is easier than to memorize" (Khandelwal et al., 2020), we propose a vertical learning paradigm for KGC task. Different from the implicit memorization strategy of HLP, VLP provides embedding models with the ability to reference related triples as cues for prediction, which can be viewed as an explicit copy strategy.
More concretely, we present the overall pipeline of VLP in Figure 2. Given a query (h, r, ?), the procedure of predicting tail t can be divided into reference query selection, reference graph construction and reference answer aggregation.
Reference Query Selection For the input query q = (h, r, ?), the VLP-based models first select N entity-relation pairs (h i , r) in the training graph as the reference queries {q i } N i=1 , which can provide relevant semantics for prediction. For example, to answer (Jill Biden, lives_in, ?), we can reference the answer-known query (Joe Biden, lives_in, ?) for target information, since Joe Biden and Jill Biden are highly related. One intuitive way for the reference selection is to choose the top-k entities in terms of the cosine similarity between h and all entities involved in relation r during the optimization. Nevertheless, this approach incurs high computational costs and is intractable. Numerically, the time complexity of such similarity calculation is O(n r d e ), where n r is the number of r-involved entities and n r ≈ |E| ≫ d e in the worst case.
In this work, inspired by the small world principle (Newman, 2001;Liben-Nowell and Kleinberg, 2007), in which related individuals are connected by short chains (e.g., Joe Biden and Jill Biden are directly connected by the marriage relationship), we introduce the graph distance based approach for efficient reference query selection. Specifically, we select N r-involved entities {h} N i=1 closest to h in terms of their relative graph distance (i.e., the shortest path length on the training graph). The corresponding ground-truth targets t i of the reference queries q i = (h i , r, ?) are referred as reference answers. In this way, VLP-based models can preretrieve N related references for every input query, thus incurring no additional computational cost for training and inference.
Reference Graph Construction After the efficient reference retrieval, we construct an edgeattributed reference graph to integrate the selected N reference queries and their corresponding answers with the input query. As shown in Figure 2, the input query q is regarded as the central node, and the reference answers t i are treated as the N neighbors. VLP-based models aims to leverage the explicit reference answers for prediction. However, since there is no guarantee that t i is the same as the target tail t, it is unreasonable to directly copy t i without any modification. For example, to answer (England, capital_is, ?), we cannot directly copy the answer of (France, capital_is, ?). Therefore, we introduce the query similarity s q,q i as the edge attribute between q and t i . By considering the query differences, VLP-based models are able to adaptively copy the reference answers. For example, to answer the input (England, capi-tal_is, ?), we can adjust the target information from Paris in terms of the difference between (France, capital_is, ?) and the input query.
Reference Answer Aggregation With the constructed reference graph, VLP-based models learn to explicitly gather target information from neighbor answers for prediction. Specifically, based on the generalized functions summarized in Section 2.2, the central node embedding q and neighbor node embedding k i can be defined as: The edge embedding s q,q i (i.e., query similarity embedding) can be further defined as: Then, combining the neighbor nodes and edge attributes, VLP-based models aggregate the reference answers to generate the final embedding t ′ : where σ(·) is a nonlinear activation function (e.g., tanh), [·, ·] is the concatenate operation, W agg , W node and W edge are shared projection matrices. The output t ′ should be close to the target tail embedding t in the latent space, whose score can be revealed by the cosine similarity: We highlight that the VLP's aggregating strategy in Equation (10) differs from GNN-based methods (Vashishth et al., 2020;Bansal et al., 2019;Shang et al., 2019;Schlichtkrull et al., 2018). For each query (h, r, ?), regardless of whether the reference query is a neighbor of h in the training graph, VLPbased models can directly attend to the reference answer throughout the entire training set.
Score Function For each triple (h, r, t) in the test sets, to alleviate the deficiency of HLP and predict more accurately, we integrate the vertical score f c with the horizontal score f g to form the final score function f with a weight hyper-parameter λ: f (h, r, t) = f c (h, r, t) + λf g (h, r, t).
Note that VLP can be widely applied to various embedding models, since the reference aggregation is designed on the generalized score function.
Complexity Analysis Compared with the vanilla embedding models, the VLP-based models only bring a few additional parameters, i.e., the shared aggregation matrices in Equation (10). Therefore, the VLP-based models have the same space complexity as the HLP-based models, i.e., O(|E| d e ). In the aspect of time cost for processing single triple, the time complexity of vanilla embedding models is O(d r d e ), derived from the generalized score function in Equation (4). The VLP-based models require the same computation for each reference, which produces the time complexity of O(N d r d e ). Such computation is tractable since a small N (no more than 8) is enough for VLP-based models to achieve high performance in the experiments.

Optimization
During training, we jointly optimize f c and f g by a two-component loss function with coefficient α: .
For the former one, we use the cross-entropy between predictions and labels as training loss: where p i and y i are the i-th components of p and y, respectively; p ∈ R |E| is calculated by applying the softmax function to the "1-to-All" (Lacroix et al., 2018a) results of f c ; y ∈ R |E| is the one-hop vector that indicates the position of true label. For the later one, negative sampling has been proved quite effective in extensive works (Song et al., 2021;Sun et al., 2019). Formally, for a positive triple (h, r, t), we first sample a set of entities ) based on the pre-sampling weights p 0 to construct negative triples (h, r, t ′ i ) (or (h ′ i , r, t)). With these samples, a negative sampling loss is designed to optimize embedding models: where γ is a pre-defined margin, σ is the sigmoid function, l denotes the number of negative samples, (h ′ i , r, t ′ i ) is a negative sample against (h, r, t). Importantly, p 1 (h ′ i , r, t ′ i ) is the post-sampling weight, which determines the proportion of (h ′ i , r, t ′ i ) in the current optimization. As shown in Figure 3, recent works (Song et al., 2021;Chao et al., 2021;Gao et al., 2020;Sun et al., 2019) utilize the self-adversarial technique (Self-Adv), in which the pre-sampling weights follow a uniform distribution and the post-sampling weights increase with the negative scores. Differently, in this work, we propose a new approach named ReD based on the relative distance, which can draw more informative negative samples and reduce the toxicity of false negative samples.
For the pre-sampling weights, considering the deficiency of embedding models as described in Section 3, the distant entities are usually hard to be predicted as the target answer. It reveals a rational priori, i.e., distant entities are more likely to form easy (meaningless) negative triples. This inspires us to sample more hard (informative) negative triples based on the relative graph distance d g . As shown in Figure 3, the pre-sampling weight in ReD decreases with the increase of graph distance between head and tail entities. Formally, for a training query (h, r, ?), we pre-sample entities t ′ to construct negatives from the following distribution: where α 0 is the pre-sampling temperature, d g (·, ·) outputs the relative graph distance between two entities. Note that the calculation of d g (·, ·) is a one-time preprocessing step, which will not bring additional training overhead. For the post-sampling weights, Self-Adv assigns greater weights to high scoring negative triples in Equation (15), which makes the optimization focus more on hard negatives. However, this monotonically increasing strategy ignores the issue of false negatives, since triples with higher scores are more likely to be correct. A more rational posteriori is that the easy negatives are underscored and the false negatives are overscored. In this work, we use the relative latent distance between the positive and negative samples to determine whether the negative score is too low or too high. Specifically, ReD defines the post-sampling weights as a distribution that first rises and then falls as the negative score increases. As shown in Figure 3, if the negative score is significantly greater than (or less than) the positive score, this negative sample is more likely to be false (or easy), and thus be assigned a small weight in the Equation 15. Formally, based on the positive score c = f g (h, r, t) and negative score n i = f g (h ′ i , r, t ′ i ), the post-sampling weight in ReD is defined as: where α 1 and α 2 are the post-sampling temperatures. By combining the sampling weights in Equation (16) and (17), ReD is able to generate and process higher quality negatives for optimization.

Experimental Setup
Datasets We evaluate our proposal on two widely-used benchmarks: WN18RR (Dettmers et al., 2018) and FB15k-237 (Toutanova and Chen, 2015). More details can be found in Appendix B.
Baselines To verify the effectiveness and generality of our proposal, we combine the proposed techniques with three representative embedding models DistMult , ComplEx (Trouillon et al., 2016) and RotatE (Sun et al., 2019). For performance comparison, we select a series of embedding models as baselines in Table 2.

Implementation Details
We fine-tune the hyperparameters with the grid search on the validation sets. Please see Appendix C for more details.

Main Results
The experimental results are reported in Table 2 (Vashishth et al., 2020) . 479 .443 .494 .546 .355 .264 .390 .535 PairRE (Chao et al., 2021) . 455 .413 .469 .539 .348 .254 .384 .539 DualE (Cao et al., 2021) .   2.4% absolute improvements in MRR, respectively. Such obvious gains reveal that the vertical contexts generally inject valuable information into the embedding models for more accurate prediction. Moreover, one can further see that ComplEx-VLP and RotatE-VLP perform competitively with the SOTA baselines. Specifically, RotatE-VLP surpasses all the baselines in terms of most metrics over both datasets; ComplEx-VLP also achieves promising performance on FB15k-237 compared with the baselines. The superior performance further confirms the effectiveness of our proposal.

Fine-grained Performance Analysis
Performance on Distance Splits Table 3 reports the performance of three VLP-based models on the distance splits defined in Appendix A. One can observe that: (1) the VLP-based embedding models outperform the vanilla models across all the distance splits; (2) the VLP models achieve greater relative improvement on the split with larger d ht . For example, as d ht increases from 1 to 4, RotatE-  VLP achieves 0.5%, 4.3%, 20.6% and 22.0% relative improvements over RotatE on the MRR metric, respectively. This reveals that the explicit vertical contexts can significantly alleviate the limitations of memory strategy in the embedding models.

Performance on Each Relation
To verify the modeling capacity of our proposal from a finegrained perspective, we explore the performance of VLP-based models on each relation of WN18RR following (Zhang et al., 2019). As shown in Table 4, compared to RotatE and QuatE, RotatE-VLP surpasses them on all the 11 relation types, confirming that the explicit reference aggregation brings superior modeling capacity. Table 5 exhibits the performance of our proposal on different relation mapping properties (Sun et al., 2019) in FB15k-237. We observe that RotatE-VLP consistently outperforms RotatE across all RMP types. Such advanced performance owes to the powerful modeling capability of the explicit copy strategy.

Impact of Reference Quantity
VLP aggregates target information from N references pre-selected before training. We investigate the impact of N on the performance (MRR) of VLP-based models. Figure 4 shows the results on WN18RR dataset. As expected, all three VLPbased models with more vertical references achieve better performance than the ones with fewer references, since the aggregation of sufficient references brings the superior modeling capacity. Moreover, we can observe that the models can achieve high performance with N less than 10, making the computation tractable as discussed in Section 4.1.

Ablation Study of ReD
To explore the effectiveness of the proposed ReD, we conduct ablation studies on the pre-sampling and post-sampling parts of the three VLP-based models. Table 6 shows the detailed results. We can observe that the removal of any part reduces the performance, which demonstrates that ReD makes the model focus more on meaningful negative samples for more effective optimization. Moreover, we also integrate ReD with original embedding models to verify the generality of this technique. Please refer to Appendix D for more results.

Related Work
Embedding models can be roughly categorized into distance based models and semantic matching models (Chao et al., 2021).  Distance based models use the Euclidean distance to measure the plausibility of each triple. A series of work is conducted along this line, such as TransE (Bordes et al., 2013) TransH (Wang et al., 2014), TransR (Lin et al., 2015), RotatE (Sun et al., 2019), PairRE (Chao et al., 2021), Rot-Pro (Song et al., 2021), ReflectE  and so on. TransE and RotatE are the most representative distance-based models, which represent relations as translations and rotations, respectively. Semantic matching models utilize multiplicative functions to score each triple, including RESCAL (Nickel et al., 2011), DistMult , Com-plEx (Trouillon et al., 2016), QuatE (Zhang et al., 2019), DualE (Cao et al., 2021) and so on. Typically, RESCAL (Nickel et al., 2011) defines each relation as the tensor decomposition matrix. Dist-Mult  simplifies the relation matrices to be diagonal for preventing overfitting. However, existing embedding models essentially follow the horizontal learning paradigm, underperforming in predicting links between distant entities.
Moreover, some advanced techniques are proposed to improve embedding models, such as graph encoders (Schlichtkrull et al., 2018;Shang et al., 2019;Vashishth et al., 2020; and regularizers (Lacroix et al., 2018b). Note that our proposals are orthogonal to these techniques, and one can integrate them for better performance.

Conclusion
In this paper, we present a novel learning paradigm named VLP for KGC task. VLP can be viewed as an explicit copy strategy, which allows embedding models to consult related triples for explicit references, making it much easier to predict distant links. Moreover, we also propose ReD, a new negative sampling technique for more effective optimization. The in-depth experiments on two datasets demonstrate the validity and generality of our proposals.

Limitations
Although our proposal enjoys the advantages of validity and generality, there are still two major limitations. First, VLP cannot directly generalize to the inductive setting, since VLP is defined based on the score functions of transductive embedding models. One potential direction is to design an inductive reference selector for emerging entities. Second, how to efficiently select more helpful references for prediction is still an open challenge. We expect future studies to mitigate these issues.

A Experimental Observation
The motive of our work originates from an experimental observation, which shows that embedding models underperform in predicting links between distant entity pairs. Specifically, according to the relative graph distance d ht between head and tail entities of each test triple, we divide the test sets of WN18RR and FB15k-237 into four splits. Three representative embedding models (DistMult, Com-plEx and RotatE) are tested on each split. Figure 5 summarizes the detailed MRR results and split ratios on the two datasets. We can observe that all three embedding models achieve promising results in link prediction between close entities, while the performance drops significantly in the prediction between distant entities. For example, on the split where d ht = 1 in WN18RR, RotatE achieves excellent performance (MRR of 0.986), while on the split where d ht = 2, the performance of RotatE decreases by about 62% (MRR of 0.375).   2018) and FB15k-237 (Toutanova and Chen, 2015) datasets are subsets of WN18 (Bordes et al., 2013) and FB15k (Bordes et al., 2013) respectively with inverse relations removed. WN18 is extracted from WordNet (Miller, 1995), a database featuring lexical relations between words. FB15k is extracted from Freebase (Bollacker et al., 2008), a large-scale KG containing general knowledge facts.

C Implementation Details
We use Adam (Kingma and Ba, 2015) as the optimizer and fine-tune the hyperparameters on the validation dataset. The hyperparameters are tuned by the grid search, including batch size b, embedding dimension d, negative sampling temperatures {α i } 2 i=0 , loss weight λ and fixed margin γ. The hyper-parameter search space is shown in Table 8.

D Embedding Models with ReD
To verify the generality of the proposed negative sampling technique ReD, we integrate ReD with  three representative embedding models (i.e., Dist-Mult, ComplEx and RotatE) for KGC task. As shown in Table 9, compared to Self-Adv, the embedding models combined with ReD achieve better performance on both datasets, since ReD guarantees more informative negative samples from both pre-sampling and post-sampling stages.