Incorporating Global Information in Local Attention for Knowledge Representation Learning

Graph Attention Networks (GATs) have proven a promising model that takes advantage of localized attention mechanism to perform knowledge representation learning (KRL) on graph-structure data, e.g., Knowledge Graphs (KGs). While such approaches model entities’ local pairwise importance, they lack the capability to model global importance relative to other entities of KGs. This causes such models to miss critical information in tasks where global information is also a signiﬁcant component for the task, such as in knowledge representation learning. To address the issue, we allow the proper incorporation of global information into the GAT family of models through the use of scaled entity importance, which is calculated by an attention-based global random walk algorithm. In the context of KRL, incorporating global information boosts performance signiﬁcantly. Experimental results on KG entity prediction against the state-of-the-arts sufﬁciently demonstrate the effectiveness of our proposed model.


Introduction
Graph Attention Networks (GATs) have been successfully applied to various tasks over graphs (Velickovic et al., 2018;Lee et al., 2018b), such as graph classification (Wu et al., 2019b;Lee et al., 2018a), link prediction (Abu-El-Haija et al., 2018), and node classification (Lee et al., 2019;Zhang et al., 2020a). GATs learn from the underlying graph structure by making use of localized attention mechanism (Wu et al., 2019a;Xu et al., 2019;Vashishth et al., 2020b), where the hidden representation of each node is computed by recursively aggregating and attending over its corresponding local neighbors' features, and the weighting coefficients are calculated inductively with self-attention strategy (Thekumparampil et al., 2018;Qian et al., 2018;Zhang et al., 2018). The original GATs perform only on single-relational homogeneous graphs (Velickovic et al., 2018;Wang et al., 2019b). Recent advancements were proposed to operate on more general and prevalent multi-relational graphs (Wang et al., 2019b;Hong et al., 2020;Nathani et al., 2019;Zhang et al., 2020c), such as the representative Knowledge Graphs (KGs) which contain multiple types of entities (nodes) and relationships (edges) (Zhou et al., 2018;Han et al., 2018;Wang et al., 2019a;. However, these approaches can only exploit localized features within the neighborhood of individual entities (Nathani et al., 2019;Busbridge et al., 2019;Zhang et al., 2020c). For some tasks, such simplified localized feature aggregation may be sufficient, but insufficient for knowledge representation learning (KRL) tasks that also need exploring global information .
In this paper, we concentrate on how to incorporate global information in local attention for knowledge representation learning. Specifically, we allow the proper incorporation of global information into the GAT family of models through the use of scaled entity importance, which is estimated by a global random walk algorithm upon the whole graph structural information. In KGs, entity importance 1 indicates the global significance or authority of an entity. Intuitively, it can be quite beneficial if an entity attends more to its "authoritative" neighbors that have high scores of global entity importance. For instance, a movie "Titanic" links to different actors, among which a superstar Right: our proposed model (EIGAT) that incorporate global information in local attention through the use of relative entity importance (REI). REI e1 (e i ) is calculated by an attention-based global random walk algorithm upon the whole graph. GAT parameterizes the edge weights based on local attention score (α 1i , also represented by the distinct edge widths). Our EIGAT adds the relative importance score (represented by different scaling of nodes), which is derived from global structural information. Note that although relationship should be drawn in the knowledge graph, for clarity, we intentionally ignore it here, which does not hurt the presentation of the basic idea of our model in this paper.
(e.g. "Leonardo Dicaprio") may be more indicative than other actors. In this paper, we propose a novel Entity Importance-aware Graph ATtention Networks, EIGAT, which incorporates global entity importance in local attention mechanism for learning effective knowledge representations. As shown in Figure 1, we give a brief illustration of our proposed EIGAT, which is compared to early proposed GCN (Kipf and Welling, 2017) and GAT (Velickovic et al., 2018). In EIGAT, the importance scores of all entities are expected to be estimated upon global information and to be incorporated in local entity aggregation (Equation 5) for building better entity embeddings. In particular, we provide an attention-based random walk approach to estimate entity importance upon global structural information for serving EIGAT. We conduct extensive experiments on several different types of KGs by entity prediction against state-of-the-art methods, which sufficiently demonstrate our proposed EIGAT can successfully incorporating global information in local attention to improve knowledge representation learning.
The contributions of this paper are threefold: • We propose to incorporate global information in local attention for knowledge representation learning.
• We propose EIGAT, a novel entity importanceaware graph attention networks which incor-porate global entity importance into local entity aggregation.
• The extensive experimental results demonstrate the efficacy of our proposed model in link prediction.

Related Work
To make this paper self-contained, we introduce some related topics here on Knowledge Representation Learning and Graph Neural Networks (GNNs).

Graph Attention Networks (GATs)
Graph Neural Networks (GNNs) develop a deep neural network to deal with arbitrary graphs for representation learning (Scarselli et al., 2008;Zhou et al., 2019;Hou et al., 2020). Graph Convolutional Networks (GCNs) are one of their most prominent progress (Schlichtkrull et al., 2018;Wu et al., 2019a;Xu et al., 2019;Vashishth et al., 2020b), which generalize local convolutional operation on the graph-structured data, i.e. gather information from one-hop neighbors and all neighbors contribute equally in the message passing. Inspired by the successful development of the attention mechanism in NLP and CV, Velickovic et al. (2018) proposed Graph Attention Networks (GATs) by incorporating local attention mechanism (Vaswani et al., 2017;Qian et al., 2018;Lu and Li, 2020) into GCNs, which calculate the hidden states of each node by attending over its neighbors (Thekumparampil et al., 2018;Lee et al., 2018b;. Recently, several advanced extensions of GATs were proposed for operating on knowledge graphs. Han et al. (2018) proposed to jointly apply attention to KGs and external text data. Busbridge et al. (2019) proposed RGAT by extending nonrelational GATs to incorporate relational information, but with poor performance. Nathani et al. (2019) proposed a triple-level attention model that captures the integrated features of both entity and relation in a given entity's neighborhood, and Zhang et al. (2020c) proposed a two-level hierarchical attention mechanism. These studies are related to our work in the sense that we all use GNNs to capture more structural information in KGs. However, all of them ignore global information in local attention computation.
Most recently, (Xu et al., 2020) proposed a Transformer-based model to enhance the copy mechanism for abstractive summarization by considering the global importance of each source word based on the degree centrality in the Transformer, which inspires our idea of incorporating global information in local attention for KRL. Table 1 summarizes the key concepts and other different settings of GNNs.

Methodology
In this section, we introduce the details of the proposed EIGAT model that incorporates global information in local attention for knowledge representation learning on KGs. We start by describing a single entity importance-aware graph attention layer, which is the building block of our model's overall architecture. Before that, we briefly introduce the notations of this paper.
Notations. In a graph attention networks with L layers, the input to -th layer ( = 1, . . . , L) are two embedding sets: (1) the output entity embeddings from ( -1)-th layer which is represented by a matrix where N e is the num-ber of entities, and η −1 is the dimension of output entity embedding in ( -1)-th layer.
(2) the output relationship embeddings from ( -1)-th layer, denoted by a matrix represent the number of relationships and the output relationship's feature dimension in ( -1)-th layer, respectively. The -th layer then produces the corresponding new output embedding matrices (of potentially different cardinality), E ∈ R η ×Ne and R ∈ R ζ ×Nr . Specifically, we describe the -th graph attention layer.

Local Attention Evaluation
indicates a relationship r k between head entity e i and tail entity e j . Following (Nathani et al., 2019), the representation v ikj of the triple t k ij is built as follows: where W 1 denotes a linear transformation matrix in -th layer, e i −1 , r k −1 and e j −1 denote the output embeddings of e i , r k and e j in ( -1)-th layer, respectively. represents concatenation. We then calculate the absolute relation attention value where W 2 and LeakyReLU are a linear weight vector in -th layer and a non-linearity active function respectively that act upon the embedding v ikj in turn. We then utilize softmax to evaluate the relative relation attention value α ikj of the triple t k ij in -th layer.
In(e j ) denotes the neighbors pointing to targeted tail entity e j , R nj denotes the set of relationships between e n and e j .

Global Entity Importance Estimation
To obtain global entity importance EI(e i ) of an entity e i , we formally introduce a relation attentionbased global random walk method, as follows: where d is a hyperparameter denoting the probability that an imaginary surfer randomly moves to a neighboring entity.
(1 − d) denotes the probability of teleporting to any other entities randomly, which is able to alleviate the information island problem caused by the isolated entities that lack of any in-degree or out-degree neighbors (e.g. #median in-degree=0 in NELL-995 in Table 2). Out(e m ) denotes the neighborhoods that an entity e m points to. EI(e m ) t−1 denotes the EI score of the entity e m in (t-1)-th iteration. The random walk distance 2 t depends on both the number of attention layers L and training epochs C, t ∈ (1, L × C]. The relation weights (e.g. b mri ) are calculated by Equation (2). Unlike conventional fixed weights-based random walk methods (Mihalcea and Tarau, 2004;Florescu and Caragea, 2017), a novelty is that the dynamic relation weights (e.g. b mri ) are iteratively and automatically optimized during training by the graph attention mechanism. In line with the theoretical desiderata for modeling node importance in MRGs, this method develops the following essential characteristics: (i) Neighborhood-awareness, i.e. neighboring EI scores can be taken into account when a given entity's importance score is modeled.
(ii) Relationship-awareness, i.e. different relationships could play a different role in propagating EI score. (iii) Centrality-awareness, i.e. more central nodes inherently and reasonably would be more important than less central nodes. (iv) Universal and flexible, i.e. it utilizes only graph global structural information.

Incorporate Global Information in Local Attention
Though attention mechanism can assign different importance to nodes via learned weights, it is still a local computation. The attention value, e.g., α ikj in Equation (3), is the function of pairwise feature interaction within local neighborhood and do not take account of entity importance from global graph structure. To this end, we incorporate global information in local attention computation, as shown in Figure 1 (EIGAT). Specifically, to generate the output embedding e j of tail entity e j in -th layer, we incorporate global relative head entity importance REI e j (e i ) in local attention to conduct entity aggregation with its associated triple representations v ikj weighted by their relative attention values α ikj , as follows: In Eq. (5), we bring in global relative entity importance REI e j (e i ) of different head entities in In(e j ) for learning more about those significant neighboring entities, and thus could get better knowledge representations for the targeted tail entity e j .
To stabilize the learning process of self-attention, as suggested by (Velickovic et al., 2018), we employ multi-head attention. Specifically, M independent attention mechanisms execute the transformation of Eq. (5), and then their features are concatenated as: We conduct a linear transformation on input relationship embedding r k −1 ∈ R ζ −1 in -th layer as: where W ,R ∈ R ζ ×ζ −1 is a weight matrix, ζ −1 and ζ are dimensions of input and output relationship embeddings, respectively. r k ∈ R ζ represents the output relationship embedding in the -th layer.

Model Architecture
Our model follows an encoder-decoder framework: (i) the encoder model includes L attention layers, (ii) the decoder model provides a scoring function (Eq. 11) to calculate the likelihood of given triples being valid. Based on it, the KG incompleteness issue is expected to be alleviated by link prediction (Section 5), i.e., inferring possible missing relations, e.g. (e i , r k , ?) or (?, r k , e j ).

Encoder
Based on a single attention layer introduced above, we build the overall architecture of our encoder model with L layers. In practice, we set L=2 for our encoder model. In the final L-th layer, instead of concatenation (Equation 7), we employ averaging and delay applying the final non-linearity activation: To keep initial entity information in the final embedding, we obtain the final entity embedding e ∈ R η L by combining the transformed initial embeddings e 0 ∈ R η 0 and the output entity embedding e L ∈ R η L of the L-th layer, as follows: W ∈ R η L ×η 0 is a projecting matrix. The initial entity embeddings (i.e. e 0 , ∀e ∈ E) and relationship embeddings (i.e. r 0 , ∀r ∈ R) are pre-trained by Bordes et al. (2013).

Decoder
Among the existing KG completion (KGC) models, we utilize the most recent model ConvKB (Nguyen et al., 2018) as decoder model 3 . Given a triple t k ij , the scoring function is formally defined as: where Ω denotes the set of filters, τ =|Ω| and ω ∈ Ω. Ω and W are shared parameters and independent of e i , r k and e j . g(·) is an activation function such as ReLU. * denotes a convolution operator. These τ feature maps are concatenated into a single vector ∈ R τ φ which is then computed with a weight vector W ∈ R τ φ via a dot product to give a likelihood score for the triple t κ ij . φ denotes the dimension of entity and relation embeddings. In practice, we set φ=η L =ζ L for ConvKB.

Optimization
We utilize a two-step training procedure for the encoder-decoder framework, which is a routine optimization way for it (Zhou et al., 2019). (i) We first train the encoder model to learn the embeddings of entities and relationships, by minimizing a hinge-loss function, as follows: Here, h t k ij = e i + r k L − e j 1 indicates the translational scoring function of the triple t κ ij (Bordes et al., 2013). γ > 0 is a margin hyper-parameter.
(ii) We then train and learn the parameters of the decoder model ConvKB for link prediction , by minimizing a soft-margin loss function, as follows: in which, . G and G are the sets of positive triples and negative triples, respectively.

Experiments
We evaluate the effectiveness of our proposed model EIGAT by link prediction (determined by Equation 11), which aims to infer possible missing relations, i.e., predict e j given (e i , r k , ?) or predict e i given (?, r k , e j ).

Datasets
We use three public benchmark datasets for link prediction experiments, including: Kinship (Lin et al., 2018), NELL-995 , FB15K-237 (Toutanova et al., 2015), where we discard another popular dataset WN18RR due to its too sparse to learn global information. The basic statistics of all datasets are included in Table 2. To explore the performance of our proposed model on different datasets with different global topology characteristics, we compute their density value (Coleman and Moré, 1983) and report them in Table 2. Since the densities in NELL-995 is sparser than Kinship and FB15K-237, and its median in-degree even is 0, it is relative hard for global entity importance estimation in NELL-995.
Definition 1. (Graph Density). Graph density aims to measure how sparse a graph is. Similar to (Coleman and Moré, 1983), given a graph G, it's formally defined as follows: where N denotes the number of nodes in G, and E denotes the number of edges in G. The lower the D(G), the sparser the graph is.

Baselines
To demonstrate the effectiveness of our proposed model EIGAT for link prediction, we compare it with the following state-of-the-art (SOTA) baselines: • TransE (Bordes et al., 2013): a most widely used and early KGC models.
• DistMult : a popular tensor factorization-based KGC model which uses a bi-linear scoring function to calculate knowledge triples' scores. • ComplEx (Trouillon et al., 2016): an advanced extension of DistMult which encodes entities and relationships into complex vector space instead of real-valued vector space.
• R-GCN (Schlichtkrull et al., 2018): an advanced extension of GCN that can effectively model multi-relational data.
• Nathani's (Nathani et al., 2019): a recent KGC model that models the local neighborhood via graph relational attention network.
• A2N (Bansal et al., 2019): a recent model that learns query-dependent representations of entities based on a GNN structure.
• HAKE (Zhang et al., 2020b): a SOTA KGC model that models semantic hierarchies • InteractE (Vashishth et al., 2020a):a recent extension of ConvE that increase the interaction between relation and entity embeddings.
• ReInceptionE : a recent extension of ConvE that uses local-global structural information.
• RGHAT (Zhang et al., 2020c): a SOTA KGC model that models the local neighborhood via hierarchical attention mechanism.

Evaluation Protocol
We utilize ranking criteria for evaluation. For each testing triple, we remove the head entity or tail entity and replace it by each of the entities in E in turn. The model scores of the corrupted triples would be computed by the decoder model (Eq. 11) and then sorted by descending order. We can obtain the exact rank of the correct triple in the candidates. Similar to most baselines, we report the experimental results in "Filter" setting, i.e. removing corrupted triples that are already present in datasets during ranking. The evaluation metrics include: the mean reciprocal rank (MRR), mean rank (MR), and the proportion of correct entities ranked in the top N (HITS@N, N=1, 3, 10). Table 3 and Table 4 report the detailed hyperparameter settings of encoder and decoder models for EIGAT, respectively. In the training, we set M=2 heads attention mechanism. The final dimensions of entity and relation embeddings are set to 200. The slop parameter α of LeakyReLU in Eq. (2) is set as 0.2 on all datasets. We use auxiliary relations from 2-hop neighborhood to aggregate more information about the neighborhoods. EI scores are initialized randomly in (0,1). We utilize a typical value for d = 0.85 (Mihalcea and Tarau, 2004;Florescu and Caragea, 2017). Table 5 and Table 6 demonstrate the results of link prediction (significance level of 0.05). We can observe that: (i) The results clearly indicate that EIGAT significantly and consistently outperforms all state-of-the-art baselines on most metrics in all benchmark datasets, which demonstrate the effectiveness of our proposed model. (ii) The advantages EIGAT compared to baselines on NELL-995 seem to be smaller than others. It is because that rich global structural information in relative dense graphs, i.e., Kinship, and FB15K-237, leads to more effective entity importance estimation by global random walk methods, comparing with less global structural knowledge in relative sparse graphs, i.e., NELL-995. The results demonstrate NELL-995 is more difficult than others for EIGAT to learn, but the comparable results also verify the effectiveness and robustness of our model on both scenarios.

Ablation Study
To analyze the behavior of global information in EIGAT, we compare EIGAT with EIGAT-Removeglobal (i.e., removing global entity importance from EIGAT). The comparison results in Table 7 indicate that EIGAT achieves improvements against EIGAT-Remove-global on all metrics. In particular, on MR, EIGAT surpasses EIGAT-Remove-global by a large margin 56. The results demonstrate our model can successfully take account of global information in local attention to aggregate more effective entity representations. Table 8 gives examples of entity prediction results of EIGAT on the FB15k-237 testing set (predicting tail entities). This illustrates the efficacy of our proposed EIGAT. Given a head entity and a relation, the top predicted tail entities (and the true one) are depicted. Even if the true fact is not always at the best front, the predicted results can still reflect common-sense.

Conclusion and Future Work
In this paper, we propose to incorporate global information in local attention for knowledge representation learning and introduce a novel GAT-based model that incorporates global entity importance. In particular, we provide an attention-based global random walk approach to estimate entity importance. The experimental results of entity prediction demonstrate that our model can successfully take into account global information in local attention to improve knowledge representation learning. Interesting future work directions include generalizing EIGAT to other relational graphs (e.g. heterogeneous information network (HIN), user-item graph in recommendation system), and exploring an advanced variant of EIGAT in a semi-supervised learning scenario.