SMiLE: Schema-augmented Multi-level Contrastive Learning for Knowledge Graph Link Prediction

Link prediction is the task of inferring missing links between entities in knowledge graphs. Embedding-based methods have shown effectiveness in addressing this problem by modeling relational patterns in triples. However, the link prediction task often requires contextual information in entity neighborhoods, while most existing embedding-based methods fail to capture it. Additionally, little attention is paid to the diversity of entity representations in different contexts, which often leads to false prediction results. In this situation, we consider that the schema of knowledge graph contains the specific contextual information, and it is beneficial for preserving the consistency of entities across contexts. In this paper, we propose a novel Schema-augmented Multi-level contrastive LEarning framework (SMiLE) to conduct knowledge graph link prediction. Specifically, we first exploit network schema as the prior constraint to sample negatives and pre-train our model by employing a multi-level contrastive learning method to yield both prior schema and contextual information. Then we fine-tune our model under the supervision of individual triples to learn subtler representations for link prediction. Extensive experimental results on four knowledge graph datasets with thorough analysis of each component demonstrate the effectiveness of our proposed framework against state-of-the-art baselines. The implementation of SMiLE is available at https://github.com/GKNL/SMiLE.


Introduction
Knowledge graph (KG), as a well-structured representation of knowledge, stores a vast number of human knowledge in the format of triples- (head, relation, tail).KGs are essential components for various artificial intelligence applications, including question answering (Diefenbach et al., 2018), recommendation systems (Wang et al., 2021b), etc.In real world, KGs always suffer from the incompleteness problem, meaning that there are a large number of valid links in KG are missing.In this situation, link prediction techniques, which aim to automatically predict whether a relationship exists between a head entity and a tail entity, are essential for triple construction and verification.
To address the link prediction problem in KG, a variety of methods have been proposed.Traditional rule-based methods like Markov logic networks (Richardson and Domingos, 2006) and reinforcement learning-based method (Xiong et al., 2017) learn logic rules from KGs to conduct link prediction.The other mainstream methods are based on knowledge graph embeddings, including translational models like TransE (Bordes et al., 2013), TransR (Lin et al., 2015) and semantic matching models like RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2015).Besides, embedding-based methods leverage graph neural networks to explore graph topology (Vashishth et al., 2020) and utilize type information (Ma et al., 2017) to enhance representations in KG.
Nevertheless, the aforementioned methods fail to model the contextual information in entity neighbors.In fact, the context of an entity preserves specific structural and semantic information, and link prediction task is essentially dependent on the contexts related to specific entities and triples.Furthermore, not much attention is paid to the diversity of entity representations in different contexts, which may often result in false predictions.Quantitatively, dataset FB15k has 14579 entities and 154916 triples, and the number of entities with types is 14417 (98.89%).There are 13853 entities (95.02%) that have more than two types, and each entity has 10.02 types on average.For example, entity Nicole Kidman in Figure 1 has two different types (Actress and Citizen), expressing different semantics in two different contexts.Specifically, the upper left in the figure describes the contextual information in type level about "Awards and works of Nicole Kidman as an actress".In this case, it is well-founded that there exists a relation between Nicole Kidman and 66 th Cannes, and intuitively the prediction of (Nicole Kidman, ?, Lane Cove Public School) does not make sense, since there is no direct relationship between type Actress and type School.But considering that Nicole Kidman is also an Australian citizen, it is hence reasonable to conduct such a prediction.
We argue that the key challenge of preserving contextual information in embeddings is how to encapsulate complex contexts of entity neighborhoods.Simply considering all information in the subgraph of entities as the context may bring in redundant and noisy information.Schema, as a high-order meta pattern of KG, contains the type constraint between entities and relations, and it can naturally be used to capture the structural and semantic information in context.As for the problem of inconsistent entity representations, the diverse representations of an entity are indispensable to be considered in different contexts.As different schema defines diverse type restrictions between entities, it is able to preserve subtle and precise semantic information in a specific context.Additionally, to yield consistent and robust entity representations for each contextual semantics, entities in contexts of the same schema are supposed to contain similar features but disparate in different contexts.
To tackle the aforementioned issues, inspired by the advanced contrastive learning techniques, we proposed a novel schema-augmented multilevel contrastive learning framework to allow efficient link prediction in KGs.To tackle the incompleteness problem of KG schema, we first extract and build a <head_type, relation, tail_type> tensor from an input KG (Rosso et al., 2021) to represent the high-order schema information.Then, we design a multi-level contrastive learning method under the guidance of schema.Specifically, we optimize the contrastive learning objective in contextual-level and global-level of our model separately.In the contextual-level, contrasting entities within subgraphs of the same schema can learn semantic and structural characteristics in a specific context.In the global-level, differences and global connections between contexts of an entity can be captured via a cross-view contrast.Overall, we exploit the aforementioned contrastive strategy to obtain entity representations with structural and high-order semantic information in the pre-train phase and then fine-tune representations of entities and relations to learn subtler knowledge of KG.
To summarize, we make three major contributions in this work as follows: • We propose a novel multi-level contrastive learning framework to preserve contextual information in entity embeddings.Furthermore, we learn different entity representations from different contexts.
• We design a novel approach to sample hard negatives by utilizing KG schema as a prior constraint, and perform the contrast estimation in both contextual-level and global-level, enforcing the embeddings of entities in the same context closer while pushing apart entities in dissimilar contexts.
• We conduct extensive experiments on four different kinds of knowledge graph datasets and demonstrate that our model outperforms stateof-the-art baselines on the link prediction task.
2 Related Work

KG Inference
To conduct inference like link prediction on incomplete KG, most traditional methods enumerate relational paths as candidate logic rules, including Markov logic network (Richardson and Domingos, 2006), rule mining algorithm (Meilicke et al., 2019) and path ranking algorithm (Lao et al., 2011).
However, these rule-based methods suffer from limited generalization performance due to consuming searching space.
The other mainstream methods are based on reinforcement learning, which defines the problem as a sequential decision-making process (Xiong et al., 2017;Lin et al., 2018).They train a pathfinding agent and then extract logic rules from reasoning paths.However, the reward signal in these methods can be exceedingly sparse.

KG Embedding Models
Various methods have been explored yet to perform KG inference based on KG embeddings.Translation-based models including TransE (Bordes et al., 2013), TransR (Lin et al., 2015) and RotatE (Sun et al., 2019) model the relation as a translation operation from head entity to tail entity.Semantic matching methods like DistMult (Yang et al., 2015) and QuatE (Zhang et al., 2019) measure the authenticity of triples through a similarity score function.GNN-based methods are proposed to comprehensively exploit structural information of neighbors by a message-passing mechanism.R-GCN (Schlichtkrull et al., 2018) and CompGCN (Vashishth et al., 2020) employ GCNs to model multi-relational KG.
More recently, some methods integrate auxiliary information into KG embeddings.JOIE (Hao et al., 2019) considers ontological concepts as supplemental knowledge in representation learning.TransT (Ma et al., 2017) and TKRL (Xie et al., 2016) leverage rich information in entity types to enhance representations.Nevertheless, these graphbased methods further capture relational and structural information but fail to capture the contextual semantics and schema information in KG.

Graph Contrastive Learning
Contrastive learning is an effective technique to learn representation by contrasting similarities between positive and negative samples (Le-Khac et al., 2020).More recently, the self-supervised contrastive learning method has been introduced into graph representation area.HeCo (Wang et al., 2021c) proposes a co-contrastive learning strategy for learning node representations from the metapath view and schema view.CPT-KG (Jiang et al., 2021b) and PTHGNN (Jiang et al., 2021a) optimize contrastive estimation on node feature level to pretrain GNNs on heterogeneous graphs.Furthermore, Ouyang et al. (2021) proposes a hierarchical con-trastive model to deal with representation learning on imperfect KG.SimKGC (Wang et al., 2022) explores a more effective contrastive learning method for text-based knowledge representation learning with pre-trained language models.

The Proposed SMiLE Framework
In this section, we first present notations related to this work.Then we introduce the detail and training strategy of our proposed framework.The overall architecture of SMiLE is shown in Figure 2.

Notations
A knowledge graph can be defined as G = (E, R, T , P), where E and R indicate the set of entities and relations, respectively.T represents the collection of triples (s, r, o) and P is the set of all entity types.Each entity s(or o) ∈ E has one or multiple types t s1 , t s2 , ..., t sn ∈ P.
The goal of our SMiLE model is to study the structure-and context-preserving properties of entity representations to perform effective link prediction tasks in knowledge graphs, which aim to infer missing links in an incomplete G. Ideally, the probability scores of positive triples are supposed to be higher than those of corrupted negative ones.Context Subgraph.Given an entity s, we regard its k-hop neighbors with related edges as its context subgraph, denoted as g c (s).Likewise, we define the context subgraph between two entities s and o as the k-hop neighbors connecting s and o via several relations, which can be represented as g c (s, o).Knowledge Graph Schema.The schema of KG can be defined as S = (P, R), where P is the set of all entity types and R is the set of all relations.Consequently, the schema of a KG can be characterized as a set of entity-typed triples (t s , r, t o ), meaning that entity s of type t s has a connection with entity o of type t o via a relation r.

Network Schema Construction
By reason of some existing KGs do not contain complete schema, inspired by RETA (Rosso et al., 2021), we design a simple but effective approach to construct schema S from a KG G.
First, for all triples (s, r, o) in KG, we convert each entity to its corresponding type, hence all entity-typed triples form a typed collection S = {(t s , r, t o )|(t s , r, t o ) ∈ P × R × P}.Noticing that each entity in KG may have multiple types, we take each combination of entity types in an entity-typed  Context Schema.Given an entity s and its context subgraph g c (s), we get an entity-typed subgraph S t (s) by converting all entities in g c (s) to their corresponding types.Then we apply the intersection operation between S t (s) and KG schema S, hence we obtain the context schema of g c (s) as:

Multi-view Entity Encoder
Generally, entities preserve multiple expressions under different views, hence we encode entities into different representations to preserve diverse features in context-and structure-view, respectively.Structure-view Encoder.Given an entity s and a relation r, we first obtain their global structureaware representations as follows: To obtain graph-structure based embeddings in KG, we adopt the GNN model as the implementation of f e (•; G) and we use the i.i.d.embedding network to implement f r (•).
Context-view Encoder.To capture inherent knowledge in a context schema, we employ the k-layer stacked contextual translation function (Wang et al., 2021a) to learn contextual embeddings of entities E c with embeddings where Enc(•) is a MLP encoder, W sc ∈ R d k ×d k+1 is a layer-specific parameter matrix and Āi ∈ R |Ec|×|Ec| denotes the semantic association matrix which is computed by multi-head attention mechanism.Then we get the context-view embedding of node s by aggregating the output of each layer:

Contextual-level Contrastive Learning
We exploit contextual-level contrastive learning to capture latent semantics and correlations of entities within context schemas.A context schema constrains the type of tail entities and relations that a head entity can be related to, and it is helpful to obtain harder negative samples, contributing to more effective contrast estimation.
Positive Samples.Given a context subgraph g c (s) and its corresponding context schema S c (s), let s be the anchor entity of g c (s) and S c (s), while the others in g c (s) be the context entities.We define the positive samples of anchor entity s in both contextual-level and global-level as follows: where T c (s) is the set of triples in g c (s).
Intra-schema Negative Samples.For two anchor entities u and v matching the same type, if context subgraphs g c (u) and g c (v) generated from them can be projected to the same context schema, we define their neighbor entities within g c (s) as negative samples of each other.Formally, given a batch of anchor entities E B , we denote the negative samples of entity s as: Generally, the number of intra-schema negative samples in a batch is usually coupled with batch size.We employ a dynamic queue to store entity embeddings from previous batches (He et al., 2020;Wang et al., 2022), aiming to increase the number of intra-schema negative samples.We denote the queued pre-batch negative samples of entity s as: where E −n B represents entities of n-th pre-batch.Since embeddings from previous batches are computed with previous model parameters, we usually limit n with a small number to ensure that they are consistent with negative samples in N cur s .The total intra-schema negative samples of anchor entity s in contextual-level are: With contextview embeddings of entities, we apply InfoNCE loss (Ouyang et al., 2021) to perform contrast estimation as follows: where τ is the temperature hyper-parameter to control the sensitivity of score function, and we apply cosine similarity as the score function ϕ.Different from previous contrast-based methods, we take multiple positive samples into consideration in computing contrastive loss.

Global-level Contrastive Learning
In addition to local contexts, it is essential to capture correlations among various context subgraphs.We apply the cross-view contrastive learning strategy to strike a balance between global schema and contextual features of KG representations.Inter-schema Negative Samples.If u and v are two anchor entities corresponding to two different context schemas, we define their context entities as negative samples of each other: , where E B indicates a batch of anchor entities.Global-level Optimization.Obtaining the embeddings of entity s under context-and structure-view, we feed them into an MLP encoder with one hidden layer, hence they are mapped into the space where the contrastive loss is calculated: , where σ is ELU activation function.It is worth noting that weight matrix {W 1 , W 2 } and bias parameter {b 1 , b 2 } are shared with embeddings of two different views.Then we perform cross-view contrastive learning between context-and structureview representations of entities as follows: where τ is the temperature hyper-parameter and ϕ is the cosine similarity score function.

Training Objective for link prediction
To capture both semantic and structural information in the context schema and individual triple for link prediction, we employ a pre-train & fine-tune pipeline to optimize our proposed SMiLE model.

Contrastive optimization in pre-training
In pre-train phase, we employ the multi-level contrastive learning strategy mentioned in 3.4 and 3.5 to optimize model parameters θ.To capture semantic and structural knowledge of entities in both contextual-and global-level, we jointly minimize the contextual-and global-level loss as follows: where λ is a balancing coefficient that controls the weight of two losses under different levels.

Fine-tuning for link prediction
With pre-trained model parameters θ as an initialization, we further fine-tune the model to learn subtler representations of individual entities and relations under the supervision of each individual triple.For a positive triple in KG, we construct its negative samples by corrupting the head or tail entity, with the restriction that the replaced entity should have the same type as the original one.Then, for each triple (s, r, o), we obtain the relation-aware embedding of head entity s as h r s = Φ(h s , z r ), where Φ(•) denotes the non-parameterized entityrelation composition operation (Vashishth et al., 2020), which can be subtraction, multiplication, circular-correlation, etc.
Next, for a triple (s, r, o), we generate its corresponding context subgraph g c (s, o) by employing a shortest path strategy, which considers the shortest path between entity s and entity o as the context.Feeding entities in g c (s, o) with their relationaware embeddings into context-view encoder, we obtain the context-view embeddings of entity s and o in triple (s, r, o), denoted as c r s and c o .The training objective in fine-tune phase is as follows: where T p and T n represent the set of positive and negative triples, respectively.ϕ r (c s , c o ) denotes the score function to measure the compatibility between entities pair via relation r.Here we adopt to the dot product similarity as , where σ is the sigmoid function.

Complexity Analysis
Theoretically, the major difference between SMiLE and previous baseline models is the negative sampling and contrastive loss, which is related to the number of negative samples.et al., 2013), ComplEx-N3 (Lacroix et al., 2018), TransR (Lin et al., 2015), TypeComplex (Jain et al., 2018) with additional type information, SANS (Ahrabian et al., 2020) with structure-aware negative samples and SOTA model PairRE (Chao et al., 2021) with paired relation vectors.The second category is GNN-based models that employ a GNN model to exploit structural information in KG, including random-walk based homogeneous network Node2vec (Grover and Leskovec, 2016), multi-relational model CompGCN (Vashishth et al., 2020), and SOTA approach SLiCE (Wang et al., 2021a) with subgraph-based contextualization.Implementation Details.We implement our SMiLE with Pytorch and adopt Adam as the optimizer to train our model with the learning rate of 1e-4 for pre-train phase and 1e-3 for fine-tune phase.Models are trained on NVIDIA TITAN V GPUs.We utilize the random walk approach to generate  Evaluation Protocol.We evaluate the performance of our SMiLE on the link prediction task.We regard the following two measurements as the evaluation metrics ( Wang et al., 2021a;Shen et al., 2021) of prediction performance: (1) Micro-F1 score; (2) AUC-ROC score.

Main Results
We compare our proposed SMiLE with various state-of-the-art models, and experimental results are summarized in Table 2.We reuse the results on FB15k-237 reported by Wang et al. (2021a) for TransE, CompGCN and SLiCE.Clearly, we can observe that our proposed model SMiLE obtains competitive results compared with the baselines.Specifically, SMiLE performs better than relationbased method CompGCN which only models relational connection within a triple, emphasizing the contextual information learned from context schema is more effective in link prediction.Furthermore, SMiLE outperforms the state-of-the-art baseline SLiCE(that shares the same backbone with SMiLE but is free of the schema context and global correlations between contexts) by a large margin on the FB15k, JF17k and HumanWiki datasets, but marginally lags behind on FB15k-237.
Compared to other datasets, the graph in FB15k-237 is much denser as the degree number of each entity is larger.In this case, models are more dependent on generalizable logic rules for KG inference.As SLiCE automatically learns meta-paths from contexts, it is quite helpful for link prediction.Besides, FB15k-237 dataset is reported to exist plenty of unpredictable links (Cao et al., 2021).Hence it is reasonable for the unsatisfactory result of SMiLE.

Ablation Study
We consider two ablated variants (contextual-level and global-level contrastive learning) of our model in the ablation study.The experimental results on FB15k and HumanWiki datasets are described in table 3. We can observe that the full model(the third row) outperforms all those with a single component by a large margin on both micro-F1 and AUC-ROC scores, further certifying that semantic and structural information in contextual-and global-level both play fruitful contributions to SMiLE.
Moreover, we have an interesting observation that global-level information contributes more to the performance of the full model on FB15k than HumanWiki dataset.We believe that the graph of FB15k dataset is much denser, because each entity in it has a larger degree on average and naturally gains more information from its local neighbors.Under this circumstance, global-level information can be a significant promotion to KG embeddings.In SMiLE, we adopt a contrastive learning strategy in the pre-train phase, which relies on the quality of negative samples.To verify whether our schema-guided sampling strategy obtains harder negatives, we compared it with a simpler relationlevel sampling strategy, which randomly corrupts h or t in a positive triple (s, r, o) with the constraint on entity type as follows:

Impact of Negative Samples
where * means that there is no relation directly connecting entity s and entity v.
As shown in Table 4, by switching the negative sampling method from schema-level to relationlevel, the micro-F1 score drops from 92.08% to 90.77%, and the AUC-ROC score drops from 97.23% to 96.48% on average, while it still leads to a competitive performance comparing to other KGE baselines.It is evident that the relation-level sampling strategy only focuses on an individual triple with type constraints on each entity, ignoring the context information of an entity.To summarize, the proposed schema-guided negative sampling strategy is capable of sampling effective hard negatives compared with traditional vanilla ones.

Analysis on Discriminative Capacity
In this section, to further demonstrate the discriminative capacity of SMiLE on link prediction task, we visualize the distribution of positive and negative triple scores computed by SMiLE, and compare it with another GNN-based model Node2vec.As shown in Figure 3, Node2vec model in 3a and 3c can not precisely discriminate positive triples and corrupted negative triples in test set, as a result of that a large number of negative triples still obtain high scores.Conversely, SMiLE in 3b and 3d increases the margin of distribution between positive triples and negative triples, offering strong evidence that our model brings positive triples closer while pushing negative triples farther away.

Case Study
To examine the effectiveness and interpretability of our proposed model, we visualize the entity embeddings in 6 different contexts.We randomly select 6 tail entities from FB15k dataset, and for each tail entity we randomly sample 50 head entities that are connected to it via a relation.We visualize these entity embeddings computed with Node2vec and SMiLE, respectively.
As shown in Figure 4, model Node2vec in Figure 4a can not separate entities in different contexts distinctly, especially there are some overlap between entities in context Warner Bros. and those in context London.Conversely, entities in different contexts are well separated by utilizing SMiLE in Figure 4b as an encoder.Moreover, the distance of entities within the same context is much closer, while the distribution of different contexts is much wider.Less overlap among clusters demonstrates that the proposed SMiLE effectively models the contextual information of entities while it distinguishes entities of different types more apart.More concretely, we list more details of related type information in Table 5.We can observe that entities of Warner Bros., California, African Americans and relations among them make up a bigger context schema.It is evident that there exist some semantic connections about "America" between them, hence distance among these clusters is closer.

Conclusion
In this paper, we propose SMiLE, a schemaaugmented multi-level contrastive learning framework for knowledge graph link prediction.We identify the critical issue of conducting effective link prediction is how to model precise and consistent contextual information of entities in different contexts.We propose an approach to automatically extract the complete schema from a KG.To fully capture contextual information of entities, we first sample the two-level negatives and perform contrast estimation in contextual-level and global-level, and then fine-tune the representations of entities and relations to learn subtler knowledge.Empirical experiments on four benchmark datasets demonstrate that our proposed model effectively captures specific contextual information and correlations between different contexts of an entity.

Limitations
In this paper, we utilize KG schema as a prior constraint to capture contextual information.However, there are several limitations in our method: 1) The construction of schema relies on explicit type information of entities while some KGs lack them.A promising improvement is to model recapitulate type semantics by utilizing linguistic information of concepts and word embeddings to capture the similarity between entities.2) The proposed negative sampling strategy may be time-consuming in large-scale KGs.For future work, a more effective way to incorporate schema contexts into both entity and relation representations is worth exploring.

B.2 Ablation on queued pre-batch negatives
To explore how much the dynamic queue for the negatives contribute, we report the experimental results in Table 8.We can observe that combining pre-batch negatives in contrastive learning is a promotion to the model performance.To demonstrate the effect of pre-training phase on capturing the contextual knowledge of entities, we disable the pre-train phase from SMiLE(denoted as SMiLE w/oP T ) and only conduct fine-tune phase for link prediction.We report the result in Table 9, and we can observe that without pre-trained entity embeddings with contextual knowledge, the performance of SMiLE decreased on both FB15k and HumanWiki datasets.

B.4 Additional Results of Discriminative Capacity
To supplement the analysis in 4.5 and further demonstrate the discriminative capacity of our proposed SMiLE on link prediction, in Figure 6 we visualize the distribution of triple scores computed with state-of-the-art GNN-based multi-relational model CompGCN (Vashishth et al., 2020).

Figure 1 :
Figure 1: An example of KG fragment.Nicole Kidman has two types Actress and Citizen, and each of them preserves different information in different contexts.

Figure 2 :
Figure 2: Overall illustration of the proposed SMiLE model: detailed framework of SMiLE model(left) and a sketch map of multi-level contrastive learning mechanism(right).

Figure 3 :
Figure 3: Histogram distribution of triple scores on FB15k and HumanWiki datasets.

Figure 4 :
Figure 4: The visualization of entity embeddings on FB15k dataset using t-SNE(Van der Maaten and Hinton, 2008).Points in same color indicate that they are head entities connected to the same tail entity via a relation.

Figure 6 :
Figure 6: Histogram distribution of triple scores computed with CompGCN and SMiLE respectively.

Table 1 :
The complexity of the pre-training phase is O(|E| * |P| * (k 1 +k 2 )), where |E| is the number of entity nodes, |P| denotes the size of positive samples in both contextual-level and global-level, k 1 and k 2 are the number of negative samples per positive sample in contextual-level and global-level.The complexity of fine-tuning Statistics of datasets used in this paper.phase is O(|R| * N c ), where |R| is the number of relation edges and N c denotes the maximum number of nodes in any context subgraph.

Table 2 :
Link prediction performance of our method(SMiLE) and recent models on FB15k-237, FB15k, JF17k and HumanWiki datasets.The best results are in bold and the second best results are underlined.

Table 3 :
Ablation study results on FB15k, JF17k and HumanWiki datasets.The best results are in bold.
subgraphs, and the embedding dimension is set to 128.Temperature τ is initialized to 0.8 and the number of contextual translation layers k is set to 4. The maximum number of negative samples in two levels is both set to 512.The GNN model f e (•; G) is implemented by Node2vec or CompGCN.Please see Appendix A.3 for more details.

Table 4 :
Performance of SMiLE with different kinds of negative samples on FB15k-237 and HumanWiki datasets.

Table 5 :
Examples of representative entities on the test set of FB15k dataset with their detailed type information.Types of Warner Bros., California, African Americans and relations among them make up a context schema.

Table 7 :
Performance of SMiLE trained with schema in different scale on FB15k.Coverage Ratio indicates the ratio of filtered entity-typed triples to all candidate ones.

Table 8 :
Performance of SMiLE in "without queued pre-batch negatives" and "full" modes on FB15k and HumanWiki datasets.

Table 9 :
Performance of SMiLE in "without pre-train" and "full" modes on FB15k and HumanWiki datasets.