Knowledge Representation Learning with Contrastive Completion Coding

Knowledge representation learning (KRL) has been used in plenty of knowledge-driven tasks. Despite fruitfully progress, existing methods still suffer from the immaturity on tackling potentially-imperfect knowledge graphs and highly-imbalanced positive-negative instances during training, both of which would hinder the performance of KRL. In this paper, we propose Contrastive Completion Coding (C), a novel KRL framework that is composed of two functional components: 1. Hierarchical Architecture, which integrates both low-level standalone features and high-level topology-aware features to yield robust embedding for each entity/relation. 2. Normalized Contrasitive Training, which conducts normalized one-tomany contrasitive learning to emphasize different negatives with different weights, delivering better convergence compared to conventional training losses. Extensive experiments on several benchmarks verify the efficacy of the two proposed techniques and combing them together generally achieves superior performance against state-of-the-art approaches.


Introduction
Knowledge graph (KG), as a well-structured representation of knowledge, plays an important role in a variety of knowledge-driven applications. Upon KG, knowledge representation learning (KRL)  aims to embed the high-dimension and usually discrete features of entities/relations into a low-dimension vector space. These learned representations, by encoding the underlying semantic relationships among entities/relations, are able to facilitate various downstream tasks, such as question answering (Bordes et al., 2014), recommendation (Wang et al., 2019b) and relation extraction (Bastos et al., 2021) to name some. As a basic research topic, KRL has always attracted many attentions of researchers in relevant domains. * Corresponding author.
Previous KRL methods generally consider KG completion (a.k.a link prediction) as the learning goal. In particular, they define certain score or energy function to accomplish the training by pushing up the score with respect to the observed positive triplets while simultaneously pushing down the score in terms of those negative ones (Ahrabian et al., 2020). To further take the KG connectivity into account, recent works propose to take advantage of graph neural network (GNN) (Vashishth et al., 2019;Ye et al., 2019) to exploit graph topology in KRL (Dettmers et al., 2018). The GNNbased approaches have dominated the state-of-theart performance in popular benchmarks.
Despite fruitful progress they have achieved, existing methods still suffer from the immature ability on tacking incomplete/noisy KG and imbalanced positive-negative pairs. Regarding the first issue, it is hard to construct perfect KG in practice owing to the expensive annotation effort, let alone that the information in KG is dynamically updating and it is difficult to detect the change at any time. In this situation, using GNN to aggregate information among noisy instances will increase the spread of noise and cause detriment to knowledge representation learning.
In terms of the second issue, it is common that the number of negative instances is much greater than that of positive instances, and the importance of different negative instances differs greatly. Recalling the training loss in previous KRL methods (such as the margin-based (Chechik et al., 2009) and logistic-based (Gutmann and Hyvärinen, 2010) loss), it coequally compares each positive instance with only one negative instance at each training iteration. In this way, it not only restrains the interaction between positive-negative instances, but also overlooks the different weights of different negative samples to each positive instance, which, in general, would lead to bias and slow training convergence. Taking the triple (Kobe Bryant, na-tionality, United States) for example, we replace the tail entity with others to generate negative triple set, including (Kobe Bryant, nationality, Italy) and (Kobe Bryant, nationality, Michael Jordan). In fact, for the second triple, Michael Jordan is not even a nation name and such negative fact should be weighted less compared to others, such as the first triple.
To address the both issues as mentioned above, this paper proposes Contrastive Completion Coding (C 3 ), a novel framework to allow robust and efficient KRL. C 3 is mainly composed of two functional parts: 1. Hierarchical Architecture, which is designed to preserve mixed information from both low-level (embedding net) and high-level (GNN) features of each instance. By ensembling different levels of features, we can make full use of topology structure by GNN while effectively suppressing the dispersion of noise over imperfect KG. 2. Normalized Contrasitive Training, which maximizes the normalized probability of the positive instance over all potential candidates that includes more than one negative sample. In this manner, the importance of different negative triples will be automatically reflected with regard to the positive instance during training. Indeed, this objective is also known as InfoNCE, a kind of mutual information loss that has been applied widely in machine learning and computer vision (van den Oord et al., 2018;Hjelm et al., 2019;Chen et al., 2020).
We summarize our contributions as follows: • We propose hierarchical KRL to deal with representation learning on imperfect KG. By integrating both low-level (embedding net) and high-level (GNN) features of each instance, C 3 can exploit the topology-aware message passing while suppressing the noisy and invalid propagation by GNN.
• We develop a Normalised One-to-Many Contrastive Objective to train the model on imbalanced positive-negative pairs. To be specific, we adopt InfoNCE, a kind of mutual information loss to attend the different importance of different negative sample, giving rise to more effective learning.
• Extensive experimental evaluations on two link prediction benchmarks, FB15k-237 and WN18RR, reveal that the two proposed techniques are effective and compatible with each other, and the proposed C 3 generally outperforms various state-of-the-art counterparts.

Related Work
Our work is closely related to two main branches of study in knowledge representation learning and contrastive loss.

Knowledge Representation Learning
Knowledge Representation Learning (KRL) is a widely studied field  with pretext tasks like KG completion. Traditionally, one line of research focuses on designing score or energy functions in margin-based models (Bordes et al., 2013;Wang et al., 2014;Ahrabian et al., 2020) or un-normalized probability models (Dettmers et al., 2018;Jiang et al., 2019;Balažević et al., 2019b). However, all of the above works adopt margin-based losses or logistic-based losses, which overlook the importance of different negative samples. In contrast, our C 3 uses a new training strategy, InfoNCE for training, which incorporates negative samples with a multiclass classification problem with a Soft-Max and crossentropy loss. For KG is a special graph-structured data, some works use a graph neural network (Schlichtkrull et al., 2018;Wang et al., 2019a;Ye et al., 2019;Vashishth et al., 2019) to extract the semantic structure information of KG. In this work, we use an embedding network and a GNN to learn different levels' features of instances in KG and preserve mutual information between context and both them.

Contrastive Loss
Contrastive losses measure the distance, or similarity, between representations in the latent space, which is one of the key differences between contrastive learning methods and other representation learning approaches (Le-Khac et al., 2020). Motivated from energy-based models (LeCun and Huang, 2005), Chopra et al. (2005) first introduce and then reformulate in (Hadsell et al., 2006) the original margin-based loss and its generalised version (Chechik et al., 2010;Collobert and Weston, 2008;Weinberger and Saul, 2009). Another form of contrastive loss is the logistic-based loss (Gutmann and Hyvärinen, 2010) , which is an estimation method for an un-normalised probabilistic model that avoids the need to evaluate the partition function through a proxy binary classification task.  Figure 1: Overview of Contrastive Completion Coding framework. C 3 is mainly composed of two functional parts: 1. Hierarchical Architecture, which is designed to preserve mixed information from both low-level (embedding net) and high-level (GNN) features of each instance. 2. Normalized Contrasitive Training, which maximizes the normalized probability of the positive instance over all potential candidates that includes more than one negative sample. |V|: the number of entities, |R|: the number of relations, d: the dimension of representations. Light green and light blue denote low-level features. Dark green and dark blue represent high-level features. The yellow vector is the representation of context. In the score table, green denotes positive score and red represents negative score. We can see a noise triple (Juanita is Michael Jordan's ex-wife, not the current wife) in the upper right corner of the KG, which will be learned by the GNN and affect the quality of knowledge representation.
prove that minimising this loss based on NCE is equivalent to maximising a lower bound on the mutual information. Chen et al. (2020) further elaborate on it advantages over other losses. For addressing the imbalance between positive triples and negative triples during KRL training, this normalized one-to-many training objective is also used in our model.

Contrastive Completion Coding
In this section, we first present the problem definition in our task, and then follow it up by providing the details of our architecture framework and the training strategy in Figure 1.

Problem Definition
Knowledge Graph is defined as G =(V, R, T ), where V, R, T represent the set of entities, re-lations and triples, respectively. Each triple (h, r, t) ∈ T indicates the relation r ∈ R between the head entity h ∈ V and the tail entity t ∈ V. We usually assume that information can flow along both directions of every edge. So for each triple (h, r, t) ∈ T , its inverse triple (t, r −1 , h) is also included in G.
KRL aims to represent entities of KG in a lowdimensional vector space {e v (v) ∈ R n |v ∈ V} and relations {e r (r) ∈ R n |r ∈ R}, where n denotes representation dimension. e v (v) and e r (r) represents the embedding of entity and relation, respectively. To do so, KRL usually conducts the KG completion task (a.k.a link prediction) as the pretext task. For example, in the case of tail entity inference, common KRL methods contend that the positive embedding should achieve the larger score than all other negative embeddings, w.r.t. to context consisting of the head entity and relation. In form, KRL objective: where g (·) represents the completion function that returns the representation of context (h, r), t + and t − denote a positive instance and a negative instance, respectively, and (h, r, t − ) ∈ G. S(·) denotes the scoring function, which will be discussed in Section 3.2.

Hierarchical Architecture
As introduced before, embedding each entity and relation with i.i.d. function will omit the graph structure that is capable of characterizing highorder interactions. On the contrary, employing GNN alone for embedding learning will be vulnerable to imperfect KG. For the sake of robust embedding, this work combines both the low-level standalone features and high-level topology-aware features. Specifically, we propose a GNN based hierarchical encoding method. Intuitively, not theoretically, different levels of features can be regarded as different views of context-instance. We imply that the low-level representation is to capture the feature view of each node instance, and the highlevel representation is to characterize the topology view of the whole KG. Encoding Function. First, we define the low-level instance feature z L obtained from the i.i.d. embedding network e L (·) as follows: where the superscripts v and r denote an entity and a relation, respectively. Then, the high-level instance feature z H obtained from the graph-aware encoding function e H (·; G) is defined below: where e H (·; G) is implemented by a specific GNN (Vashishth et al., 2019). Completion Function. For inferring the missing part of the triple, the completion function is proposed to encode the context representation c.
where the completion function g(·) can be implemented as any type of Addition, Multiplication, Decomposition, MLP, Convolution, etc. We also define c L and c H are the context representation vectors, which are generated by the compeltion function using (z v L , z r L ) and (z v H , z r H ) respectively. Scoring Function. The scoring function S(·) measures the similarity or distance between two inputs. A trivial form of S(·) is given by a inner/dot product between two vectors S(z, c) = z c. This is a most commonly used measurement in literature (Dettmers et al., 2018;Vashishth et al., 2019). Another popular option is utilizing the cosine similarity, S(z, c) = z c z c , whose value is bounded between -1 and 1, and equal to 0 for orthogonal vectors. Unless otherwise specified, we adopt the cosine similarity in our method.
To allow hierarchical scoring, we contrast the context vector c with both low-level feature z L and high-level feature z H as a weighted combination: where 0 ≤ ρ ≤ 1 is a hyper-parameter that controls trade-off of both levels. We will discuss this hyper-parameter in Section 4.5. The reason why we choose the combination in Eq. 5 to calculate the hierarchical score is mainly for the consideration of calculation efficiency and experimental effect. We will discuss it in detail in Section 4.3. Albeit its simplicity, our experiments support that such simple linear combination is sufficient to provide desired performance.

Normalized Contrasitive Training
With the scoring function at hand, the last step is how to formulate a training objective to fulfil the ranking in Eq. 1. There exist two typical training losses including the margin-based method (Chechik et al., 2009) and the logisticbased method .
Specifically, the margin-based objective is given by as well as the gradient w.r.t. c: As for the logistic-based method, it considers a surrogate binary classification task using the logistic-based loss function. To be specific, it computes along with the gradient w.r.t. c as follow: where σ(·) is the Sigmoid function. Although these two kinds of losses have been applied widely in KRL, they contrast the one-toone difference between the positive and negative instances, which is unable to handle the imbalance between positive triples and negative triples during training, provided that the number of negatives are usually far greater than that of positives. In addition, by checking their gradients, the update directions by these two objectives are distributively related with each instance without further drawing the different importance of different negative. Inspired by (Ahrabian et al., 2020;Chen et al., 2020), it is essential to mine "hard" negative samples to avoid easy pairs that provide no substantial learning signal in any learning system.
In order to overcome this limitation, we propose to apply a normalized one-to-many training objective (one positive and many negatives at a time). In particular, we sample a candidate set as Z = z + , z − 1 , . . . , z − N −1 with one positive instance but apply all possible negative samples in the objective function, leading to a total sample number as N . Different entity could have different number of negative samples, hence N varies. We then compute the score between each candidate and the context c. By applying a Soft-Max (Bishop, 2006;Goodfellow et al., 2016) on all scores, the training target is to maximize the normalized score of the positive instance, leading to with the gradient given by From Eq. 11, we can see that the gradients of the negatives are no longer treated equally and are weighted by the relativity to the sum of the exponentiate scores of all samples. If this term is large, then the corresponding negative sample will greatly influence the gradient and the training process. This property is clearly different from the conventional gradient in Eq. 7 where the weights of all negative samples are the same (i.e. 1). In this way, the training will focus more on the crucial negative sample with large relativity, yielding better convergence. We will compare its effectiveness with other training losses in the experiments Section 4.5.
Note that Eq. 10 is also known as InfoNCE loss that is initially proposed in CPC (van den Oord et al., 2018). It is proved that InfoNCE is actually a lower bound of mutual information, in other words, Following normalised-temperature crossentropy (NT-Xent) loss (Chen et al., 2020), we also use a temperature parameter τ to control the sensitivity of the scoring function. Note that the temperature τ determines the attraction-repulsion radius around the context, and thus acts similarly as the margin γ in the margin-based loss.
In summary, the objective of C 3 is derived as (13) For better readability, we illustrate the flowchart of our method in Algorithm 1.

Baselines.
We compare our C 3 with the following previous state-of-the-art KRL methods: TransE (Bordes et al., 2013)    For all experiments, we adopt SGD with momentum as the optimizer to train our models. We used a cosine decay schedule (Loshchilov and Hutter, 2016;Chen et al., 2020) with the initial learning rate set as 1e-4, the momentum as 0.9, and the temperature τ as 0.07 (He et al., 2020). Unless otherwise specified, the trade-off hyper-parameter ρ is set to 0.5. The embedding net e L (·) that we use is a one-layer learnable embedding network, the GNN e H (·) is the GCN model used in COMPGCN (Vashishth et al., 2019), and the completion function follows the implementation in ConvE (Dettmers et al., 2018). Evaluation Metrics. We use the following two measurements as our evaluation metrics: (1) Mean Reciprocal Rank; (2) Hits@10, Hits@3 and Hits@1 that indicate the proportion of correct answers ranked in top 10, 3, 1, respectively. Table 1 shows the performance comparisons between our C 3 models and SOTA models on WN18RR and FB15k-237 datasets. The results of all SOTA methods are taken directly from the previous papers (Vashishth et al., 2019;Balažević et al., 2019b;Ahrabian et al., 2020). Clearly, in terms of the most two crucial metrics MRR and H@10, our method improves the baseline COMPGCN (that shares the same backbone with our method but is free of hierarchical embeddings and contrastive training) from 0.479 to 0.492 in MRR and 0.546 to 0.572 in H@10, which validates the effectiveness of our two proposed contributions. Overall, to the best of our knowledge, C 3 outperforms all existing methods on both datasets and achieves superior performance against state-of-the-art approaches.

Analysis of Hierarchical Structure
Different Context-Instance Training Strategies. We try all context-instance combinations to study all kinds of context-instance relationships in Table 2 1 . We find that the method using both lowlevel features e L and high-level features e H of instances is better than the variant that using either low-level or high-level features. In particular, the (MRR, H@10) of our results is (0.492, 0.572), while the counterparts with only high-level or lowlevel features achieve (0.478, 0.565) and (0.458, 0.517), respectively under the same context c H . Interestingly, for the method of using only low-level features e L , we find context representation c L performs better than c H . When we use both features of instances, c H outperforms c L , which implies that deeper c H has more expressive capacity. The results support the hypothesis that leveraging lowlevel and high-level features is able to capture different levels of contextual information, thus more  capable of knowledge representation learning.
Hierarchical Scoring Function. We have chosen the hierarchical scoring function shown in Eq. 5 as discussed preciously. In fact, it is possible to leverage both c H and c L under any combination as shown in Table 3. We choose the combination (c H for z L , c H for z H ) in Eq. 5 due to the following two reasons: 1. From the perspective of computational efficiency, the first two combinations will double the FLOPs by computing two-level contexts. 2. As reported by Table 3, it achieves the best result, which explains that the score between the high-level context and the features of both levels is sufficient to capture the multi-view patterns in KG.

Quantitative Statistics on Different Levels Features.
We conduct experiments on the test set to verify the importance of low-level and highlevel features, respectively. Table 4 shows that the number of S H ≤ S L is much more than that of S H > S L when predicting entities on FB15k-237, while low-level features and high-level features of instances have almost equal effects on the prediction results on WN18RR. It may imply that the degree of incomplete/noise varies greatly in different datasets. The above two experimental results show that we can make full use of topology structure by GNN while effectively suppressing the dispersion of noise over imperfect KG by ensembling different levels of features.

Hierarchical Structure for Noise Suppression
To validate the assumption that different levels of features help in proportion to the degree of incomplete/noisy information present on different datasets, we introduce 10% and 20% noise to the datasets according to the principle in the   CKRL . The results on WN18RR are shown in Table 9 2 . According to the results, we can see that the last column result is better than the other columns. Hence, we further confirm the method of using both low-level features e L and high-level features e H of instances is better than using only the low-level features or high-level features. In particular, with the increase of noise, the gap between our method and other methods increases, which supports the robustness of our method on preventing noise. For example, the improvement regarding MRR between our C3 and the high-level baseline (COMPGCN) is increased from 0.018 to 0.043 when the noise is from 10% to 20%.

Contrasitive Training
Loss Function. We evaluate the effects of C 3 using different loss functions: margin-based loss, un-normalized logistic-based loss, and normalized probability-based InfoNCE loss as what we have done above. The experimental results are shown in Figure 2. By observing the best MRR recorded on the validation set during the training process, we can find that 1) using margin-based loss converges slowly and has poor performance; 2) using logistic-based loss converges slowly at first, but after a certain period of warming up, it exhibits a faster convergence speed; 3) using InfoNCE, both   the convergence rate and eventual performance far exceed the other two counterparts, thereby demonstrating the rationality of our choice. Analysis of Key Hyper-parameters. Table 6 shows the results when the hyper-parameter ρ in Eq. 5 takes different values. This hyper-parameter controls the trade-off between both levels features. We can see that the best value of ρ lies between 0.2 and 0.8 on both datasets. We empirically set ρ to be 0.5, and find it works promisingly. Figure 3 records the impact of different representation dimensions. It is observed that, as the dimension increases, the performance of C 3 improves gradually and steadily, while the growth rate decreases gradually. On the contrary, the results of other compared methods such as COMPGCN, almost keep unchanged when the dimension varies. It is supported that, our C 3 model benefits more from a larger representation dimension than its KRL counterparts 3 , which may imply that the benefit of our method in better mining the representation capacity within the input graphs.
Analysis of the Train Time. For better addressing imbalanced positive-negative pairs to increase the interaction between positive-negative instances and approximating the lower bound of mutual information in Eq. 12, we sample as many negative instances as possible to better normalize the probability of the positive instance over all potential candidates. Nevertheless, sampling all negative instances, as the same procedure applied in both COMPGCN and our method, occupies a very small proportion in total computation. This is because the representations of all entities have already be obtained in memory when calculating InfoNCE, and the main calculations lie in the SoftMax with respect to all negative representations, which counts little compared to the representation computations. For example, the sampling time for COMPGCN and our C3 are close, about 0.16s/iter and 0.21s/iter, respectively.

Conclusion
In this paper, we present C 3 , a novel knowledge representation learning framework, which is mainly composed of two functional parts: 1) Hierarchical Architecture, which has also exhibited the effectiveness in suppressing the spread of noise.

A.2.2 Evaluation Metrics
In this paper, we conduct our experiments on the KG completion task. It concentrates on the quality of knowledge representations (Socher et al., 2013), which aims to complete a triple when head entity or tail entity is missing.
We conduct two measures as our evaluation metrics: (1)Mean Reciprocal Rank, that is a relative score that calculates the average of the inverse of the ranks at which the first relevant entity was retrieved for a set of queries. and (2)Hits@10, Hits@3 and Hits@1 indicate the proportion of correct answers ranked in top 10, 3, 1 respectively.
For COMPGCN which is closely related to our method, we have conducted the comparison in a fair and comprehensive setting to justify the significance of our proposed idea. For other methods (such as SANS) we have tried to reproduce the results for all metrics but fail to obtain the comparable numbers as reported. Hence, a conservative solution is to directly copy the numbers from their papers.

A.2.3 Hyper-parameters
For selecting the best model, we perform a hyperparameter search using the validation data over the values listed in Table 8 through selecting the highest MRR. In our best setting, we use learnable convolution networks ConvE as our completion function g(·). The best learning rate lr = 0.09, the batch size is 128, the representation dimension is 500, the dropout is 0.1 and the composition operators is multiplication for two-layers f gnn for FB15k-237 (600epoch) and circular-correlation in one-layer f gnn for WN18RR (800epoch). Our C 3 model build on PyTorch geometric framework(Compatible with Python 3.x). Total number of parameters of C 3 model is 64.613M, and total number of FLOPs is 9.154G.      Table 12: Effects of completion function and loss function. Experiments settings: representation dimension = 500, batch size =128. Results in the first three rows show that convolution completion function gives a substantial improvement than others. And the last three rows of results show the performance of InfoNCE loss function far exceeds others.