HittER: Hierarchical Transformers for Knowledge Graph Embeddings

This paper examines the challenging problem of learning representations of entities and relations in a complex multi-relational knowledge graph. We propose HittER, a Hierarchical Transformer model to jointly learn Entity-relation composition and Relational contextualization based on a source entity’s neighborhood. Our proposed model consists of two different Transformer blocks: the bottom block extracts features of each entity-relation pair in the local neighborhood of the source entity and the top block aggregates the relational information from outputs of the bottom block. We further design a masked entity prediction task to balance information from the relational context and the source entity itself. Experimental results show that HittER achieves new state-of-the-art results on multiple link prediction datasets. We additionally propose a simple approach to integrate HittER into BERT and demonstrate its effectiveness on two Freebase factoid question answering datasets.


Introduction
Knowledge graphs (KG) are a major form of knowledge bases where knowledge is stored as graph-structured data. Because of its broad applications in various intelligent systems including natural language understanding Zhang et al., 2019b;Hayashi et al., 2020) and reasoning (Riedel et al., 2013;Xiong et al., 2017;Bauer et al., 2018;Verga et al., 2020), learning representations of knowledge graphs has been studied in a large body of literature.
To learn good representations of knowledge graphs, many researchers adopt the idea of mapping the entities and relations in a knowledge graph to points in a vector space. These knowledge graph embedding (KGE) methods usually leverage geometric properties in the vector space, such as translation (Bordes et al., 2013), bilinear transformations (Yang et al., 2015, DistMult), or rotation (Sun et al., 2018). Multi-layer convolutional networks are also used for KGE (Dettmers et al., 2018, ConvE). Such KGE methods are conceptually simple and can be applied to tasks like factoid question answering (Saxena et al., 2020) and language modeling . However, learning a single link (edge) in a knowledge graph at a time, these approaches only exploit local connectivity patterns but ignore the vast structural information in the knowledge graph.
Meanwhile, a separate line of work tries to use graph neural network (GNN) methods to learn representations with global graph context (Bruna et al., 2014;Defferrard et al., 2016;Kipf and Welling, 2017). However, these GNN methods are originally designed for simple homogeneous graphs, so they cannot handle the prevalent heterogeneous structures in knowledge graphs. To address this issue, Schlichtkrull et al. (2018) propose the relational graph convolutional network (R-GCN) which can enhance context-independent KGE methods like DistMult with contextual information. More recently,  borrow the entity-relation composition operations from existing KGE methods (e.g., Dist-Mult, ConvE) to extend the aggregation functions of several multi-relational GCN methods. But it is still unclear whether such methods capture the relational context effectively in complex multi-relational knowledge graphs with a number of edge types, given their partially inferior performance compared to simpler contextindependent KGE methods. In addition, GNN methods are known to suffer from depth limita- T [GCLS] T e src Figure 1: Our model consists of two Transformer blocks organized in a hierarchical fashion. The bottom Transformer block captures the interactions between a entity-relation pair while the top one gathers information from an entity's graph neighborhood. Taking the entity embeddings E e and the relation embeddings E r as input, the output embedding T [GCLS] is used for predicting the target entity. We sometimes mask or replace E esrc with E [MASK ] or E e random . In which case, an additional output embedding T esrc can be used to recover the perturbed entity. The dashed box indicates a simple context-independent baseline where M esrc is directly used for link prediction.
tions (Zhao and Akoglu, 2020; and efficiency issues when scaling up (Rossi et al., 2020), so they are usually restricted in expressiveness and cannot easily scale to large knowledge graphs. We propose HittER, a deep hierarchical Transformer model to learn representations of entities and relations in a knowledge graph jointly by aggregating information from a graph neighborhood. Like GNN methods, our proposed model is also intended to capture the relational graph context, but leverage the Transformer (Vaswani et al., 2017) which has proved effective and efficient at scale in various tasks (Parmar et al., 2018;Devlin et al., 2019). It has even been shown that Transformers can learn relational knowledge from large amounts of unstructured textual data (Jiang et al., 2020;Manning et al., 2020). Furthermore, there is an analogy between Transformers, which can be seen as processing complete graphs (Cai and Lam, 2020), and GNN methods that deal with more generic graphs. 1 Essentially, our proposed model consists of two different Transformer blocks where the bottom block provides relationdependent entity embeddings for the neighbor-1 In Transformers, every token aggregates information from all tokens via the self-attention mechanism. This process is similar to dealing with a complete graph by GNNs. hood around the training entity and the top block aggregates information from the graph context (see Figure 1). We further design a masked entity prediction task to balance the contextual relational information and information from the training entity itself, guided by dataset-specific graph properties.
We evaluate our proposed method using the link prediction task, which is one of the canonical tasks in statistical relational learning (SRL). Link prediction serves as a good proxy to evaluate the effectiveness of learned graph representations, by measuring the ability of a model to generalize relational knowledge stored in training graphs to unseen facts. Meanwhile, it has an important application to knowledge graph completion given the fact that most of the knowledge graphs are still highly incomplete. Our approach achieves new state-of-the-art results on two standard benchmark datasets FB15K-237 (Toutanova and Chen, 2015) and WN18RR (Dettmers et al., 2018).
The remainder of this paper is organized as follows. In Section 2, we describe our proposed method for link prediction in knowledge graphs. We then show our experimental results in Section 3. In Section 4, we discuss different kinds of graph contexts and some limitations of our model. Section 5 reviews related work and Section 6 con-cludes the paper.

HittER
We introduce our proposed hierarchical Transformer model (Figure 1) in this section. In Section 2.1, we formally define the link prediction task in a knowledge graph, and demonstrate how to solve it by a simple Transformer scoring function. We then cover the detailed architecture of our proposed model in Section 2.2. Finally, we discuss our strategies to learn balanced contextual representations of an entity in Section 2.3.

Transformers for Link Prediction
Formally, a knowledge graph can be viewed as a set of triplets (G = {(e s , r p , e o )}) and each has three items including the subject e s ∈ E, the predicate r p ∈ R, and the object e o ∈ E to describe a single fact (link) in the knowledge graph. Our model approximates a pointwise scoring function ψ : E × R × E → R which takes a triplet as input and produces a score reflecting the plausibility of the fact represented by the triplet. In the task of link prediction, given a triplet with either the subject or the object missing, the goal is to find it from the set of all entities E. Without loss of generality, we describe the case where an incomplete triplet (e s , r p ) is given and we want to predict the object e o . And vice versa, the subject e s can be predicted in a similar process, except that the reciprocal predicate will be used to distinguish these two cases (Lacroix et al., 2018). We call the entity in the incomplete triplet the source entity e src and call the entity we want to predict the target entity e tgt .
Link prediction can be done in a straightforward manner with a Transformer encoder (Vaswani et al., 2017) as the scoring function, depicted inside the dashed box in Figure 1. Our inputs to the Transformer encoder are randomly initialized embeddings of the source entity e src , the predicate r p , and a special [CLS] token which serving as an additional bias term. Three different learned type embeddings are directly added to the three token embeddings similar to the input representations of BERT (Devlin et al., 2019). Then we use the output embedding corresponding to the [CLS] token (M esrc ) to predict the target entity, which is implemented as follows. We first compute the plausibility score of the true triplet as a dot-product between M esrc and the token embedding of the tar-get entity. In the same way, we also compute the plausibility scores for all other candidate entities and normalize them using the softmax function. Lastly, we use the normalized distribution to get the cross-entropy loss L LP = − log p(e tgt | M esrc ) for training. We will use this model as a simple context-independent baseline later in experiments. A similar approach has been explored in Wang et al. (2019).
Learning knowledge graph embeddings from one triplet at a time ignores the abundant structural information in the graph context. Our model, as described in the following section, also considers the relational neighborhood of the source vertex (entity), which includes all of its adjacent vertices in the graph, denoted as N G (e src ) = {(e src , r i , e i )}. 2

Hierarchical Transformers
We propose a hierarchical Transformer model for knowledge graph embeddings ( Figure 1). The proposed model consists of two blocks of multi-layer bidirectional Transformer encoders.
We employ the Transformer described in Section 2.1 as our bottom Transformer block, called the entity Transformer, to learn interactions between an entity and its associated relation type. Different from the previous described contextindependent scenario, this entity Transformer is now generalized to also encode information from a relational context. In specific, there are two cases in our context-dependent scenario: 1. We consider the source entity with the predicate in the incomplete triplet as the first pair; 2. We consider an entity from the graph neighborhood of the source entity with the relation type of the edge that connects them.
The bottom block is responsible of packing all useful features from the entity-relation composition into vector representations to be further used by the top block. The top Transformer block is called the context Transformer. Given the output of the previous entity Transformer and a special [GCLS] embedding, it contextualizes the source entity with relational information from its graph neighborhood. Similarly, three type embeddings are as-signed to the special [GCLS] token embedding, the intermediate source entity embedding, and the other intermediate neighbor entity embeddings. The cross-entropy loss for link prediction is now changed as follows.
The top block does most of the heavy lifting to aggregate contextual information together with the information from the source entity and the predicate, by using structural features extracted from the output vector representations of the bottom block.

Balanced Contextualization
Trivially supplying contextual information to the model during learning might cause problems. On one hand, since a source entity often contains particular information for link prediction, the model may learn to ignore the additional contextual information, which could also be noisy. On the other hand, the introduction of rich contextual information could in turn downgrade information from the source entity and cause potential over-fitting problems. Inspired by the successful Masked Language Modeling pre-training task in BERT, we propose a two-step Masked Entity Prediction task (MEP) to balance the process of contextualization during learning.
To avoid the first problem, we apply a masking strategy to the source entity of each training example as follows. During training, we randomly select a proportion of training examples in a batch. With certain probabilities, we replace the input source entity with a special mask token [MASK], a random chosen entity, or just leave it unchanged. The purpose of these perturbations is to introduce extra noise to the information from the source entity, thus forcing the model to learn contextual representations. The probability of each category is dataset-specific hyper-parameter: for example, we can mask out the source entity more frequently if its graph neighborhood is denser (in which case, the source entity can be easily replaced by the additional contextual information).
In terms of the second problem, we want to promotes the model's awareness of the masked entity. Thus we train the model to recover the perturbed source entity based on the additional contextual information. To do this, we use the output embedding corresponding to the source entity T esrc to predict the correct source entity via a classification layer. 3 We can add the cross-entropy classification loss to the previous mentioned link prediction loss as an auxiliary loss, as follows.
This step is important when solely relying on the contextual clues is insufficient to do link prediction, which means the information from the source entity needs to be emphasized. And it is otherwise unnecessary when there is ample contextual information. Thus we use dataset-specific configurations to strike a balance between these two sides. However, the first step of entity masking is always beneficial to the utilization of contextual information according to our observations. In addition to the MEP task, we implement a uniform neighborhood sampling strategy where only a fraction of the entities in the graph neighborhood will appear in a training example. This sampling strategy acts like a data augmenter and similar to the edge dropout regularization in graph neural network methods (Rong et al., 2020). We also have to remove the ground truth target entity from the source entity's neighborhood during training. It will otherwise create a dramatic traintest mismatch because the ground truth target entity can always be found from the source entity's neighborhood during training while it can rarely be found during testing. The model will thus learn to naively select an entity from the neighborhood.

Experiments
We describe our experiments in this section. Section 3.1 introduces two popular benchmarks for link prediction. We then describe our evaluation protocol in Section 3.2, and the detailed experimental setup in Section 3.3. At last, our proposed method are assessed both quantitatively and qualitatively in Section 3.4, and several ablation studies are conducted in Section 3.5.

Datasets
We evaluate our proposed method on two standard benchmark datasets FB15K-237 (Toutanova and Chen, 2015) and WN18RR (Dettmers et Table 2. Notably, WN18RR is much sparser than FB15k-237 which implies it has less structural information in the local neighborhood of an entity. This will affect our configurations of the masked entity prediction task consequently.

Evaluation Protocol
The task of link prediction in a knowledge graph is defined as an entity ranking task. Essentially, for each test triplet, we remove the subject or the object from it and let the model predict which is the most plausible answer among all possible entities. After scoring all entity candidates and sorting them by the computed scores, the rank of the ground truth target entity is used to further compute various ranking metrics such as mean reciprocal rank (MRR) and hits@k, k ∈ {1, 3, 10}.
We report all of these ranking metrics under the filtered setting proposed in Bordes et al. (2013) where valid entities except the ground truth target entity are filtered out from the rank list. 4 We intentionally omit the original FB15K and WN18 datasets because of their known flaw in testleakage (Toutanova and Chen, 2015

Experimental Setup
We implement our proposed method in Py-Torch (Paszke et al., 2019) under the LibKGE framework (Ruffinelli et al., 2020). To perform a fair comparison with some early baseline methods, we reproduce the baseline results by using the best hyper-parameter configurations for them from LibKGE. 5 Our model consists of a three-layers entity Transformer and a six-layers context Transformer. Each Transformer layer has eight heads. The dimension size of hidden states is 320 across all layers except that we use 1280 dimensions for the position-wise feed-forward networks inside Transformer layers suggested by Vaswani et al. (2017). We set the maximum numbers of uniformly sampled neighbor entities for every example in the FB15K-237 and WN18RR dataset to be 50 and 12 respectively. Such configurations ensure most  examples (more than 85% of the cases in both datasets) can have access to its entire local neighborhood during inference. During training, we further uniformly sample 70% and 50% of entities from these fixed-size sets in the FB15K-237 and WN18RR dataset. We train our models using Adam (Kingma and Ba, 2015) with a learning rate of 0.01 and an L2 weight decay rate of 0.1. The learning rate linearly increases from 0 over the first tenth training steps, and linearly decreases through the rest of the steps. We apply dropout (Srivastava et al., 2014) with a probability p = 0.1 for all layers, except that p = 0.6 for the embedding layers. We apply label smoothing with a rate 0.1 to prevent the model from being over-confident during training. We train our models using a batch size of 512 for at most 500 epochs and employ early stopping based on MRR in the validation set. When training our model with the masked entity prediction task, our dataset-specific configurations are listed as follows: We do not include the auxiliary loss. Table 1 shows that the results of HittER compared with baseline methods including some early meth-ods and previous SOTA methods. 6 We outperform all previous work by a substantial margin across nearly all the metrics. Comparing to some previous methods which target some observed patterns of specific datasets, our proposed method is more general and is able to give more consistent improvements over the two standard datasets. For instance, the previous SOTA in WN18RR, RotH explicitly captures the hierarchical and logical patterns by hyperbolic embeddings. Comparing to it, our model performs better especially in the FB15K-237 dataset which has a set of diverse relation types. On the other hand, our models have comparable numbers of parameters to the baseline methods, since entity embeddings contribute to the majority of the parameters. Table 3 lists the entity clustering results of first few entities in each dataset, based on our learned entity representations. Clusters in FB15K-237 usually are entities of the same type, such as South/Central American countries, government systems, and American voice actresses. While clusters in WN18RR are generally looser but still relevant to the topic of the central word.

Ablation Studies
To figure out the contributions from each aspect of our proposed method, we perform several ablation studies in this section. Table 4 shows the results of removing the masked entity prediction task described in Sec-   tion 2.3 (i.e., "No MEP") and entirely removing the context Transformer from the full model (the context-independent Transformer for link prediction described in Section 2.1, i.e., "No context"). In FB15K-237, we find that the "No context" model is already very strong which demonstrates our entity Transformer's capability of capturing interactions between entities and their associated relations. Adding contextual information can further improve our model while our proposed entity masking strategy plays a very important role in both datasets. Breaking down the model's performance by type of relations in WN18RR, Table 5 shows that incorporating contextual information brings us substantial improvements on two major relation types, namely the hypernym and the member meronym relations, which both include many examples belong to the challenging one-to-many relation categories defined in Bordes et al. (2013).
Inferring the relationship between two entities can be viewed as a process of aggregating information from the graph paths between them (Teru et al., 2020) development set of WN18RR by the number of hops (i.e., the shortest path length in the undirected training graph) between the subject and the object in each example ( Figure 2). From the results, we can see that the MRR metric of each group decreases by the number of hops of the examples. This matches our intuition that aggregating information from longer graph paths is generally harder and such information is more unlikely to be meaningful. Comparing models with and without the contextual information, the contextual model performs much better in groups of multiple hops ranging from two to four. The improvement also shrinks as the number of hops increasing.

Right Context for Link Prediction
Structural information of knowledge graphs can come from multiple forms, such as graph paths, sub-graphs, and the local neighborhood that we used in this work. In addition, these context forms can be represented in terms of the relation type, the entity, or both of them.
In this work, we show that a simple local neighborhood is sufficient to greatly improve a link prediction model. In early experiments in the FB15K-237 dataset, we actually observe that masking out the source entity all the time does not harm the model performance much. This shows that the contextual information in a dense knowledge graph dataset like FB15K-237 is meaningful enough to replace the source entity itself in the link prediction task.
Recently, Wang et al. (2020b) argue that graph paths and local neighborhood should be jointly considered when only the relation types is used (throwing out entities). Although some recent work has made a first step towards utilizing graph paths for knowledge graph embeddings (Wang et al., 2019(Wang et al., , 2020a, there are still no clear evidence showing its effectiveness.

Limitations of the 1vsAll Scoring
Recall that HittER learns a representation for an incomplete triplet (e s , r p ) and then computes the dot-product between it and all the candidate target entity embeddings. This two-way scoring paradigm, which is often termed 1vsAll scoring, supports fast training and inference when the interactions between triplet elements are captured by some computation-intensive functions (i.e., Transformers in our case), but unfortunately loses three-way interactions. We intentionally choose 1vsAll scoring for two reasons. On the one hand, 1vsAll together with cross-entropy training has shown a consistent improvement over other alternative training configurations empirically (Ruffinelli et al., 2020). On the other hand, it ensures a reasonable speed during the inference stage which inevitably requires the 1vsAll scoring.
Admittedly, early interactions between the source entity and the target entity can provide valuable information to inform the representation learning of the incomplete triplet (e s , r p ). For instance, we find that a simple bilinear formulation of the source entity embeddings and the target entity embeddings can reflect the distance (the number of hops) between them. We leave the question of how to effectively incorporate such early fusion for future work.

Related Work
Link prediction using knowledge graph embeddings has been extensively studied in several diverse directions.

Triple-based Methods
Most of the previous work focuses on exploiting explicit geometric properties in the embedding space to capture different relations between entities. Early work uses translational distance-based scoring functions defined on top of entity and relation embeddings (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Ji et al., 2015). Another line of work uses tensor factorization methods to match entities semantically. Starting from simple bi-linear transformations in the euclidean space (Nickel et al., 2011;, numerous complicated transformations in various spaces have been hence proposed (Trouillon et al., 2016;Ebisu and Ichise, 2018;Sun et al., 2018;Zhang et al., 2019a;Chami et al., 2020;Tang et al., 2020). Such methods effectively capture the intuition from observation of data but suffer from unobserved geometric properties and are generally limited in expressiveness.
In light of recent advances in deep learning, more powerful neural network modules such as Convolutional Neural Networks (Dettmers et al., 2018), Capsule Networks (Nguyen et al., 2019 are also introduced to capture the interaction between entity embeddings and relation embeddings. Similar to our entity Transformer block, Wang et al. (2019) use the Transformer to contextualize an entity embeddings with relation embeddings. These methods produce richer representations and better performance on predicting missing links in KG. However, they learn from pairwise local connectivity patterns in KG and ignore the structured information stored in graph context.

Context-aware Methods
Various forms of graph context have been proven effective in recent work on neural networks operating in graphs under the message passing framework (Bruna et al., 2014;Defferrard et al., 2016;Kipf and Welling, 2017). Schlichtkrull et al. (2018, R-GCN) adapt the graph convolutional networks to realistic knowledge graphs which are characterized by their highly multi-relational nature. Teru et al. (2020) incorporate a edge attention mechanism to R-GCN, showing that the relational path between two entities in a knowledge graph contains valuable information about their relations in an inductive learning setting.  explore using existing knowledge graph embedding methods to improve the entityrelation composition in various GCN methods. Bansal et al. (2019) borrow the idea from Graph Attention Networks (Veličković et al., 2018), using a bi-linear attention mechanism to selectively gather useful information from neighbor entities. Different from their simpler attention formulation, we use the advanced Transformer to capture both the entity-relation and entity-context interactions. Nathani et al. (2019) also propose an attentionbased feature embedding to capture multi-hop neighbor information, but unfortunately, their reported results have been proven to be unreliable in a recent re-evaluation .

Conclusion and Future Work
In this work, we proposed HittER, a novel Transformer-based model with proper training strategies for learning knowledge graph embeddings in complex multi-relational graphs. We show that with contextual information from a local neighborhood, our proposed model outperforms all previous approaches in the link prediction task, achieving new state-of-the-art results in two standard datasets FB15K-237 and WN18RR.
Future work can try to apply our proposed method to other graph representation learning tasks besides link prediction. Currently, our proposed hierarchical Transformers are only capable of aggregating information from a local neighborhood. Another future work is to extend it with a broader graph context.
Since the Transformer has become the SOTA model in language model pre-training (Devlin et al., 2019) and many other tasks in natural language processing. One promising direction is to merge our proposed model with other Transformer-based models to perform text-graph joint reasoning in tasks that involve both text and graph data.