Contrastive Learning with Generated Representations for Inductive Knowledge Graph Embedding

With the evolution of Knowledge Graphs (KGs), new entities emerge which are not seen before. Representation learning of KGs in such an inductive setting aims to capture and transfer the structural patterns from existing entities to new entities. However, the performance of existing methods in inductive KGs are limited by sparsity and implicit transfer. In this paper, we propose VMCL , a C ontrastive L earning (CL) framework with graph guided V ariational autoencoder on M eta-KGs in the inductive setting. We first propose representation generation to capture the encoded and generated representations of entities, where the generated variations can augment the representation space with complementary features. Then, we design two CL objectives that work across entities and meta-KGs to simulate the transfer mode. With extensive experiments we demonstrate that our proposed VMCL can significantly outperform previous state-of-the-art baselines.


Introduction
Knowledge Graphs (KGs) structure objective knowledge in the form of ("head entity", "relation", "tail entity") triples to express factual connections.Representation learning of KGs aims to learn implicit representations (embeddings) of entities and then applies the learned representations to knowledge-intensive tasks such as link prediction (Li et al., 2021) and question answering (Hao et al., 2017).Conventionally, KG are embedded in a transductive setting with a fixed predefined set of entities under the assumption that entities to be tested are seen during training.In this transductive setting, the learned entity representations from a source KG can easily be applied to a target KG (Sun et al., 2019).
However, it is well admitted that KGs are not static, rather they evolve over time where new KGs Figure 1: Example of inductive KGs with a source KG and a target KG.Entities in the target KG are different from entities in the source KG.However, there are similarities in relational patterns, e.g., entity Facebook in the source KG has relations: Headquartered-in, Economicsector, Member-of, and entity 1 in the target KG has similar relation patterns.So it can be deduced that entity 1 may be a company like Facebook.
with a novel set of entities emerge.For KGs in such an inductive setting, the learned representations of existing (source) entities are not applicable to the new (target) entities (Chen et al., 2022).Thus, the problem of capturing structural patterns from existing source entities and transferring them to a new set of target entities is of practical importance and posits a new set of interesting research challenges to tackle.We provide empirical insights about the representation learning problem in the inductive setting with an example in Fig. 1.Entities in the target KG are new which are different from entities in the source KG.While there are similarities in structural patterns between the two KGs, e.g., entity 1 in the target KG may be a company like Facebook in the source KG.Representation learning of inductive KGs aims to capture the structural patterns from the source KG, then transfer them to help the learning for the target KG.Note that what are transferred are structural patterns, not the learned representations because entities in the target and source KGs are different.
Several efforts devoted to representation learning of inductive KGs.AMIE (Galárraga et al., 2013), RuleN (Meilicke et al., 2018), Neural-LP (Yang et al., 2017) and DRUM (Sadeghian et al., 2019) learn probabilistic logical and entity-independent rules from the source KG and apply such rules to the target KG.GraIL (Teru et al., 2020) and CoM-PILE (Mai et al., 2021) learn the ability of relation prediction by subgraph extraction and graph neural networks independent of any specific entities, and generalize such ability to the target KG.Recently, MorsE (Chen et al., 2022) learns transferable and entity-independent meta knowledge by meta learning (Qin et al., 2023).Although these methods have shown promising results for representation learning of inductive KGs, their performance is affected by at least two factors.First, their capability is limited by the sparsity of KGs caused by missing links.In four inductive KGs: N1-N4 (Teru et al., 2020), the degree of 65%, 47%, 46%, 42% entities is less than 3. Second, effective transfer is often limited by the implicitness, e.g., MorsE uses meta learning to learn transfer patterns implicitly between source and target KGs, lacking explicit joint comparison across the entities and KGs.
In this work, we propose VMCL, a Contrastive Learning (CL) framework with graph guided Variational autoencoder on Meta-KGs1 in the inductive setting.Fig. 2 gives the framework.We first propose representation generation to capture the encoded and generated representations of entities, where the generated variations can augment the representation space with complementary features (Fig. 2(1)).Then, we design two CL objectives that work across entities and meta-KGs to simulate the transfer mode (Fig. 2(2)).We perform extensive evaluation and theoretical derivation to understand the performance of VMCL.Empirical results and analysis show the superiority of VMCL over state-of-the-art baselines.
In summary, our key contributions are: • We propose a graph guided variational autoencoder to augment entity representations, which provides complementary features to alleviate the sparsity and lay the foundation for transfer.
• We simulate explicitly the transfer mode with CL across entities and meta-KGs.
• Extensive experiments show the superiority of VMCL over state-of-the-art baselines.We release the code and datasets in https://github.com/feiwangyuzhou/VMCL.
2 Related Work

Knowledge Graphs
Knowledge Graphs (KGs) structure objective facts to express potential connections between entities (Li et al., 2022b).In general, KGs are considered in a transductive setting where they remain static.Representation learning in this transductive setting aims to learn informative representations of entities through a source KG and apply the learned representations to a target KG, because the entities in the target KG also appear in the source KG.TransE (Bordes et al., 2013) and RotatE (Sun et al., 2019) design complex similarity scoring functions to mine semantic features for entities and relations.GGAE (Li et al., 2021) introduces attention networks (Huang et al., 2021) while R-GCN (Schlichtkrull et al., 2018) adapts GNNs to encode the features of entities and relations with k-hop neighbor structures.Although many methods have been proved effective in handling various transductive KGs, they cannot handle tasks related to entities unseen during training.
As mentioned, KGs evolve over time and new KGs with new entities emerge.This produces inductive KGs where the target KG is inducted by the source KG with predefined relations.In the inductive setting, entities in the target KG could be different from the entities in the source KG.Representation learning of inductive KGs aims to capture structural patterns from the source KG, then transfer them to the target KG.Representation learning of inductive KGs is more difficult than that of transductive KGs, but it is more realistic and practical.Much work devotes to the representation learning of inductive KGs with probabilistic logical rules (Galárraga et al., 2013;Meilicke et al., 2018;Yang et al., 2017;Sadeghian et al., 2019), relation prediction ability (Teru et al., 2020;Mai et al., 2021) and meta-learning framework (Chen et al., 2022).
However, their feature mining and transfer capability is limited by sparsity and implicit transfer.
In contrast, we propose representation generation to augment the representation space, and design two CL objectives that work across entities and meta-KGs to simulate the transfer mode.

Contrastive Learning
CL aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors, which has achieved great success in vision (He et al., 2020), text (Gao et al., 2021) and graph (Zhu et al., 2021).CL in graphs improves the performance by leveraging a contrastive loss at node (Velickovic et al., 2019) and graph (Sun et al., 2020) levels.(Hassani and Ahmadi, 2020) and (Zhu et al., 2021) attempt to contrasts multiple structural views of graphs.(Ahrabian et al., 2020), (Wang et al., 2022) and (Li et al., 2022a) introduce CL into transductive KGs by designing negative sampling strategies.Although there has been a lot of CL concerned with graphs and transductive KGs, relatively little work focuses on inductive KGs.In contrast, we propose a simple but effective CL framework for inductive KGs to transfer knowledge from source KGs to target KGs.

Variational AutoEncoder (VAE)
Deep generative models have recently attracted much attention in that they can generate unseen samples which have the same distribution as the original data (Kipf and Welling, 2016;Simonovsky and Komodakis, 2018).The generated data, as a supplement to original data, can help the model better mine potential features and make the model more robust.VAE (Kingma and Welling, 2014) is an unsupervised generative framework and has been extensively studied and applied in various tasks such as question answering (Zhang et al., 2018) and graph autoencoder (Ahn and Kim, 2021;Li et al., 2023).In this paper, we introduce a graph guided VAE to generate similar representations of entities, which broadens the representations with complementary features.

Inductive KGs
We denote a KG as a graph G=(E, R), where E refers to the set of entities and R refers to the set of relations in the KG.A KG organizes entities and relations as triples of (h, r, t), where h, t ∈ E respectively represent the head and tail entities, and r ∈ R represents the relation between h and t.A group of inductive KGs includes at least a source KG G S =(E S , R S ) and a target KG G T =(E T , R T ) with the condition R T ⊆ R S and E T ∩ E S =∅.The goal of representation learning for inductive KGs is to capture the structural patterns f from G S and transfer it to G T .We expect f to be able to map entities in G T into representations with the following property: the score of a true triple (h, r, t) ∈ G T should be higher than that of any false triple (h − , r − , t − ).
In general, target KGs (1) may retain some old entities, e.g., character relationships, or (2) may not retain any, e.g., new events (however, there are similarities in the propagation modes of different events).To our knowledge, case (2) is more difficult than case (1).Our definition is designed for case (2), and can be easily applied to case (1) by initializing old entities with old embeddings.

Meta-KGs
Following MorsE (Chen et al., 2022), we use meta-KGs to simulate the transfer learning mode, where each meta-KG M i is a subgraph sampled randomly from the source KG G S .We re-label the id of all entities in M i in order to make the process label insensitive as our main interest is to capture the structural patterns.We extract multiple meta-KGs |M| being the number of meta-KGs sampled.For any two meta-KGs

Representation Notations
In the remaining, we use the following notations: • For a meta-KG we follow (Chen et al., 2022) and initialize the representation of each entity with its relation patterns.Formally, the representation of entity h in a meta-KG M i is initialized with features of the involved relations in M i : where we use O(h)={r|∃x, (h, r, x) ∈ M i }, I(h) = {r|∃x, (x, r, h) ∈ M i } to denote outgoing and incoming relations for entity h.
• For a meta-KG to denote the encoded representation and generated representation of entity h, respectively.

Method
Fig. 2 depicts the overall framework of our VMCL.
In the following, we first introduce representation generation to capture the encoded and generated representations of entities in each meta-KG.Then, we design two CL objectives that work across entities and meta-KGs to simulate the transfer mode.
Training Procedure is described at the end.

Representation Generation
With the initialized representations of entities, we augment the representation space with variants that are similar but still different from the initialized ones (Fig. 2(1)).We design a graph guided variational autoencoder to generate such variations because it provides finer-grained control (through the prior) to generate unseen representations with the same distribution as the original data (Kipf and Welling, 2016).Its architecture includes an encoder and a decoder, where the encoder aims to learn the encoded representation h and latent Gaussian distribution N (µ h , σ 2 h I), and the decoder aims to generate representation h g by sampling from the above latent Gaussian prior.
The encoder first captures the graph features and then learns a latent Gaussian distribution.We use an graph neural network (GNN) to modulate the representations of entities with their multi-hop neighborhood structures, denotes the set of (head entity, relation)-pair of immediate incoming neighbor triples of entity h.W l r ∈ R D l ×D l−1 is the relation-specific transformation matrix for relation r in the l-th layer, W l 0 ∈ R D l ×D l−1 is a self-loop transformation matrix for entity h in the l-th layer, D l and D l−1 are the dimension in the l-th and l − 1-th layers.σ is a ReLU activation function.For simplicity, let Eq. (2) be, We use L-layer GNNs to capture the graph features, where h 0 e =h 0 with D 0 =D (Eq.( 1)), {D 0 , ..., D L } are a list of dimensions.Then we use h L e with dimension D L to learn a latent Gaussian distribution, where W µ and W σ denote the weight matrices corresponding to the mean and variance of the (latent) Guassian distribution, respectively.The decoder generates representation h g for entity h.We use the reparametrization trick to sample z from the above latent Guassian distribution, where ϵ ∼ N (0, I) (i.e., a standard normal distribution), • denotes element-wise multiplication.We (re)construct h g from z using another L-layer GNNs.The GNNs use dimension list {D L , ..., D 0 } which is reversed with that in Eq. ( 3) and thus we denote this process from L-layer to 0-layer, where h L d =z.Let h g = h 0 d denote the generated representation of entity h.
We optimize the parameters with the combined loss of reconstruction and KL divergence, During inference, we encode a representation h and generate a representation h g for entity h in M i .To distinguish, we use M g i to denote the generated meta-KG which has the same structure as M i but different representations, i.e., the representation of entity h in M i is the encoded representation h and that of entity h g in M g i is the corresponding generated representation h g3 .In summary, the generated representation variations can augment the representation space with complementary features, which are later used in the CL objectives.

Transfer with Contrastive Learning
With the support of encoded representation h and generated representation h g , we design two CL objectives to simulate the transfer mode across entities and meta-KGs (Fig. 2(2)).
Transfer across Entities focuses on enhancing transferability across entities.We design a CL objective that works across entities inside a meta-KG, where we use a triple P =(h, r, t) in the meta-KG M i as a positive sample, and contrast it with negative samples N I .h I = h + h g and t I = t + t g are respectively the representations of the head and tail entity computed as the sum of the encoded and generated representations.The representations of the head and tail entities in a negative sample use the same summation format.For relation, we use its initial representation r. τ is the temperature hyperparameter and β(h I , r, t I ) is a similarity score which can be any Knowledge Graph Embedding (KGE) methods, such as TransE (−||h I + r − t I ||) (Bordes et al., 2013) or RotatE (−||h I • r − t I ||) (Sun et al., 2019).For a positive sample P =(h, r, t), its negative samples can be generated as, where N h is a set of negative samples with number-U generated by replacing the head entity of (h, r, t) with negative "head" entities h − j , which are sampled from a candidate entity list r, t) with E M i (r, t) being an entity list of true head entities (i.e., h j ∈ E M i (r, t) iff (h j , r, t) ∈ M i ).Similarly, N t is a set of negative samples generated by replacing the tail entity of (h, r, t) with negative "tail" entities t − j , which are sampled from an entity list ) being an entity list of true tail entities (i.e., t j ∈ E M i (h, r) iff (h, r, t j ) ∈ M i ).Likewise, N r is a set of negative samples generated by replacing the relation of (h, r, t) with negative relations r − j , which are sampled from a candidate relation list R S − R M i (h, t) with R M i (h, t) being a relation list that satisfies the condition: To analyse how this CL loss affects the transfer across entities, we perform gradient analysis.The gradients with respect to the head entities are: where ′ denotes the partial derivative, ψ is a normalization constant (see Appendix).The gradients of the head entity h are closely related to positive entity t, negative entities t − j ∈ N t and relations r − j ∈ N r .Thus, this CL can transfer the features from positive and negatives samples to head entities.The gradients with respect to the tail entity t are closely related to positive entity h, negative entities h − j ∈ N h and relations r − j ∈ N r (see Appendix), which proves the transferability from positive and negatives samples to tail entities.
Transfer across Meta-KGs aims to learn transferability across meta-KGs.With the support that the generated representation h g in M g i has the same distribution with respect to the encoded representation h in M i , we add a relation r g to create the link between the two meta-KGs M i and M g i .Specifically, for entity h in M i , we form the h g in M g i as a positive sample and use the entities in other meta-KGs {M k } Ua k=0 as negative samples.We design a CL objective that works across meta-KGs to simulate the transfer mode, (14) Similar to the positive sample, we form each negative sample as (h, r g , t − k,j ) by adding r g between M i and M k , where t − k,j ∈ M k denotes the j-th negative entity sampled from M k and t − k,j is its representation; {t − k,j } g ∈ M g k denotes the j-th negative entity sampled from M g k and {t − k,j } g is its representation.We sample U negative entities randomly from both M k and M g k , and sample U a negative meta-KGs from the same batch.
To analyse how this CL loss affects the transfer across meta-KGs, we perform gradient analysis.The gradients (− ∂L A (h,r g ,h g ) ∂h ) with respect to the head entities are, where ψ g is a normalization constant (see Appendix).According to the gradients, this CL can transfer positive features from h g ∈ M g i and negative features from t In summary, these CLs, simulating the transfer mode across entities and multiple KGs, can enhance transferability explicitly to help the model capture transferable structural patterns.And our CL objectives are relatively independent of the representation generation module.Although we use the generated representations as contrastive objects, we can use the random, initial or word embeddings as contrastive objects for other models and tasks.

Training Procedure
PreTrain (PT) stage is trained on multiple meta-KGs (extracted from the source KG) with the combined loss of the generation loss (L G ), CL losses (L I , L A ) and task-specific loss (L ζ ): (16) where the weights η 1 and η 2 are used to bring the two kinds of losses to the same order of magnitude.We use link prediction as a downstream task to optimize the parameters.
where h I , r, t I are the representations of h, r, t (similar to Eq. ( 9)); ζ(n) is a self-adversarial negative sampling function which generates different weights for different negative samples according to their importance to the triple (h, r, t).ϑ is the sigmoid function, λ is a fixed margin.FineTune (FT) With pretrained parameters (by the source KG) as initialization, the FT stage is finetuned on the target KG with task-specific loss.
Obtaining meta-KGs.Our model uses meta-KGs to simulate the inductive setting where a meta-KG M i is sampled from a source KG G S with the following steps (Fig. 2(1)): (1) Sample an entity from the entity list E G S of the source KG G S , and put it into set E M i .( 2) Sample an entity from E M i , walk randomly n 1 times with length-n 2 , and put the walked entities into set E M i .(3) Repeat the above step (2) n 3 times and use entities in E M i to induce a meta-KG M i .( 4) Anonymize the entities in M i by re-labeling the id of entities to be {1,2,...,|E M i |} in an arbitrary order.

Evaluation Metrics and Baselines
We report MRR, H@N scores to evaluate the performance of link prediction in the test set of the target KG.The results are averaged by head and tail predictions.Following the settings in (Chen et al., 2022), the results of baselines and VMCL are approximated by ranking each test triple among 50 other randomly sampled negative triples five times.
To show the effectiveness of our approach, we compare VMCL against several strong baselines: • One Stage models are directly adapted to the target KG after being trained on the source KG.They do not have pretraining-finetuning steps.This includes: RuleN (Meilicke et al., 2018), Neural-LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019), GraIL (Teru et al., 2020), CoMPILE (Mai et al., 2021).
• PT+FT models are trained on the source KG then finetuned on the target KG.This includes variants of MorsE (Chen et al., 2022) with different KGEs (TransE/RotatE).To our best knowledge, these MorsE models are the state-of-the-art baselines.
VMCL is a PT+FT model.We design variants of VMCL with different KGEs (TransE/RotatE) and training settings (Full/-PT/-FT).The Full is finetuned on the target KG with pretrained parameters from the source KG then uses the finetuned param- eters to test the target KG, -PT variant is trained on the target KG with random parameters, -FT variant is trained on the source KG then uses pretrained parameters to test the target KG.Please refer to Appendix (A.3) for the settings and hyperparameters.

Results
In this section, we conduct extensive experiments to show the effectiveness of VMCL.
• VMCL over Baselines The results of VMCL and baselines are shown in Table 2 (for F1-F4 datasets) and Table 3 (for N1-N4 datasets). 4We notice that VMCL achieves significant improvements over the baselines.The one stage baselines perform poorly because they only allow training on the source KG and lack adaptive training on the target KG, while MorsE is better than those as it also finetunes the model on the target KG.Our VMCL, which uses generated representations to provide complementary features and uses CL to enhance the transfer across entities and KGs, outperforms MorsE by a large margin.Specifically, the H@1 score of VMCL increases by 1.36, 5.97, 1.22, 2.57 points in N1, N2, N3,and N4 datasets, respectively (Table 3).And we find that VMCL performs better on sparse KGs (F1-F3, N1-N4), and does not have a significant effect on dense KGs (F4), which aligns with our motivation to alleviate sparsity.These significant improvements over the baselines prove the effectiveness of our model. 4The default KGE model of MorsE and VMCL is RotatE.
• VMCL over PT/FT Stages We report the results of VMCL with Full/-PT/-FT settings in Fig. 3. First, we show the necessity of PT stage by comparing the Full with -PT models.The Full model outperforms -PT with sizeable margins in F1-F4 and N2-N4 and competitive margin in N1.Concretely, H@1 score of Full increases by 22.83 in F1, 12.31 in F2, 7.86 in F3, 6.27 in F4 and 1.06 in N1, 10.52 in N2, 5.60 in N3, and 6.61 in N4.The performance improvement of Full over -PT indicates the transfer capability of PT stage in VMCL.That is, the structural patterns, captured by PT stage from the source KG, can help the representation learning of the target KG achieve better results.Second, we show the necessity of FT stage by comparing the Full with -FT models.The performance of Full is better than that of -FT on all datasets.Notably, H@1 score of Full increases by 9.93 point in F1, 5.35 in F2, 7.53 in F3, 10.79 in F4, and 0.86 in N1, 29.85 in N2, 25.97 in N3, and 43.50 point in N4.The above results of Full over -FT show that although the pretrained parameters by the source KG can bring performance improvements, it is necessary to finetune the parameters on the target KG to adapt to its own characteristics.MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 Because the number of triples in meta-KG |M i | is small relative to source KG, the calculation speed in fact is very fast.

Conclusion
In

Limitations
The representation dimension (default is 32) is important but limited by our GPU resources.With the support of large GPU, a large dimension (e.g., 512) may achieve better performance.We also attempt to expand the inductive setting (Relational patterns are same in the source and target KGs) to the independent setting (The source KG is independent from the target KG), but experimental performance is not good.That is, if the two KGs are irrelevant, it may be impossible to transfer information.We first investigate the ablation study (Table 6) in N1-N4 datasets.The representation generation module improve the performance in N2-N4 datasets, and these CL modules (across entities and meta-KGs) play a crucial role in improving performance of N1 datasets.

A.3 Settings and Hyperparameters
The results of the baselines with * are taken from (Chen et al., 2022) while the results of MorsE and its variants (TransE/DistMult/ComplEx/RotatE) are reproduced with publicly available code and optimal values of hyperparameters. 5The reproduced baselines and our VMCL are implemented in PyTorch and DGL with a single GeForce RTX 2080 GPU.With the same setting with baselines, walk times n 1 = 10, walk length n 2 = 5, and repeat times n 3 = 10, the number of negative entities U = 32(PT)/64(FT), the dimension D = 32, the fixed margin λ = 10.The optimizer is set to Adam with learning rate of 0.01/0.001, the epoch is set to 10/100 for pretraining/finetuning stages.For our VMCL, the dimension D L is set to 128, D 1 = ... = D l−1 = D.The layers L of the encoder is selected from {1, 3, 5} and the best L is 3.
The number U a of negative meta-KGs is selected from {0, 1, 4, 8} and the best U a is 4. The temperatures τ , τ ζ is selected from {0.01, 0.05, 1} and the best τ is 0.05 and the best τ ζ is 1.The loss weights η 1 , η 2 are selected from {.002, .001,.0005}and the best η 1 is .001and the best η 2 is 1.

Figure 2 :
Figure 2: The overall framework of VMCL.(1) For a given meta-KG M i , Representation Generation uses a graph guided variational autoencoder to generate representation variations (denoted by M g i ) of the entities.(2) Transfer with Contrastive Learning simulate explicitly the transfer mode with CL across entities and meta-KGs.
denotes learnable relation representations which reserve internal semantics of relations; r ∈ R is the representation for the relation r.D is the dimension of representations.• R head ∈ R |R S |×D and R tail ∈ R |R S |×D are learnable relation embeddings to initialize the representation of the respective entities.For a relation r, r head ∈ R head denotes the relation embedding for the head entities connected by r, and r tail ∈ R tail denotes the relation embedding for the tail entities connected by r.

Table 1 :
Dataset statistics.Rel, Ent show the no. of relations, entities in the source/target KG.Train, Valid, Test show the no. of triples in train, valid, test sets of the source/target KG.

Table 2 :
(Chen et al., 2022)r baselines on F1-F4 datasets, where the results with * are taken from(Chen et al., 2022).Bold numbers denote the best results.VMCL is significantly better than MorsE with p-value=.028.

Table 3 :
(Chen et al., 2022)r baselines on N1-N4 datasets, where the results with * are taken from(Chen et al., 2022).Bold numbers denote the best results.VMCL is significantly better than MorsE with p-value=.006.

Table 4 :
Results of different KGE models.