HAHE: Hierarchical Attention for Hyper-Relational Knowledge Graphs in Global and Local Level

Link Prediction on Hyper-relational Knowledge Graphs (HKG) is a worthwhile endeavor. HKG consists of hyper-relational facts (H-Facts), composed of a main triple and several auxiliary attribute-value qualifiers, which can effectively represent factually comprehensive information. The internal structure of HKG can be represented as a hypergraph-based representation globally and a semantic sequence-based representation locally. However, existing research seldom simultaneously models the graphical and sequential structure of HKGs, limiting HKGs’ representation. To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE), including global-level and local-level attention. The global-level attention can model the graphical structure of HKG using hypergraph dual-attention layers, while the local-level attention can learn the sequential structure inside H-Facts via heterogeneous self-attention layers. Experiment results indicate that HAHE achieves state-of-the-art performance in link prediction tasks on HKG standard datasets. In addition, HAHE addresses the issue of HKG multi-position prediction for the first time, increasing the applicability of the HKG link prediction task. Our code is publicly available.


Introduction
Knowledge graphs (KGs) are semantic networks that define entity relationships. Early KG research (Bordes et al., 2013;Sun et al., 2019;Balazevic et al., 2019) use binary relationships, often expressed as a triple-based fact (subject, relation, object). Yet, n-ary relational facts (containing more than two entities) are abundant in real-world KGs like Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014). Rosso et al. (2020) represent an n-ary relational fact as a As shown in Figure 1, an H-fact can describe a real-world fact. Unlike traditional triple-based facts, H-Facts do not just raise the number of entities in facts from two to n. It structurally and effectively represents the n-ary relational facts prevalent in reality. Globally, it extends ordinary graph structure to hypergraph (Zhou et al., 2006) structure. Locally, it defines five heterogeneous roles of s,r,o,a,v within facts to capture the semantic information of the fact'Barack Obama held position as US president', as illustrated in Figure 2.
Recent research has demonstrated various embedding strategies for hyper-relational representations. However, current approaches only consider global hypergraph structures or local semantic sequence structures. For instance, StarE (Galkin et al., 2020) employs the information transfer function of graph neural networks (GNN) to unidirectionally pass auxiliary key-value pair information into the main triples' relations, thereby capturing the graph structure but insufficiently between multiple entities and relations within the H-facts. In contrast, GRAN (Wang et al., 2021) initially incorporates the Transformer encoder (Vaswani et al., 2017) into the HKG embedding, capturing the fully connected semantic information locally inside Hfacts, while disregarding the global structure. Consequently, representing the global and local structure of HKG simultaneously with hierarchical attention becomes a promising research direction, but an inadequate representation of HKG structure constrains HKG embeddings.
To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE) that incorporates global-level and locallevel attention. We update the global node embeddings using the HKG hypergraph structure. However, by complete connectivity, the previous hypergraph attention network (Bai et al., 2021) just converts all hypergraph nodes into a regular graph and then utilizes the GAT (Veličković et al., 2018) layer for node embedding updates, rendering it unable to distinguish which nodes comprise a hyperedge. Consequently, we design hypergraph dual-attention layers to aggregate node embedding information into hyper-edge embedding through the attention mechanism. After obtaining the hyper-edge embedding, we update the node embedding by feeding it back to the node through the attention mechanism. In this way, nodes are allowed to learn more distant information from the whole HKG. This hypergraph dual-attention method significantly enhances learning capacity. It then transfers the updated node information to the local level's attention. Inspired by GRAN's heterogeneous attention (Wang et al., 2021), we define five types of nodes and fourteen types of edges in a single H-Fact and develop heterogeneous self-attention layers with both nodebias and edge-bias attention to learn the semantic content of H-Facts. The last step is to output the link prediction findings using an MLP-based decoding process for one-position or multi-position link prediciton tasks on HKGs.
Experiments on link prediction were performed on three HKG standard datasets, JF17K (Wen et al., 2016), Wikipeople (Guan et al., 2019), and WD50K (Galkin et al., 2020). The state-of-theart results indicate that HAHE is effective in the link prediction task. In addition, adequate ablation experiments were designed to highlight the importance of global and local focus, and HAHE is also used for the HKG multi-position prediction task, i.e., predicting two or more entities or relations simultaneously in a single H-fact, hence increasing the applicability of the HKG link prediction task. Ultimately, we make our code publicly available and discuss the limitations and future work of HKG embedding representation.
Early approaches consider hyper-relational fact(H-Fact) mainly as graph structure, focusing more on the topological relations of entities. For example, m-TransH (Wen et al., 2016) projects the entities onto the relation hyperplane. RAE, NaLP, NeuInfer, and N-TuckER (Zhang et al., 2018;Guan et al., 2019Guan et al., , 2020Liu et al., 2020) optimize the method of the Hyper-relational knowledge graph (HKG) embedding based on m-TransH. However, none of them adopts the hyper-relational structure.
HINGE (Rosso et al., 2020) firstly proposes the attribute-value qualifiers for embedding hyperrelational representation using CNN, and Hy-per2 (Yan et al., 2022) initializes the relation and entity in the Poincaré ball vectors to improve the model accuracy. Yet, neither of these methods considers graphical structure or semantic sequences of HKGs.
StarE (Galkin et al., 2020) employs GNN as the message-passing mechanism to encode entities and relationships, while Transformer is the decoder to get the result. HyTransformer (Yu and Yang, 2021) applies layer normalization and dropout methods to replace StarE's encoder. MSeaHKG (Di and Chen, 2022) proposed that a message-passing function significantly impacts the model performance, so it replaced the static message-passing function in StarE with a dynamic one. These models consider the graph structure within the hyper-relational facts, but disregard the semantic sequences.
GRAN (Wang et al., 2021) is an improvement for Transformer (Vaswani et al., 2017). It replaces Transformer's self-attention with edge-biased fullyconnected attention and accurately collects semantic information. Despite this, it ignores the graph structure.
Unlike earlier models, HAHE considers graph structure and semantic sequences simultaneously and employs hierarchical attention for link prediction on HKGs. Moreover, it is the first to improve previous methods by modeling the structure of HKG via global-level embedding and can perform multi-position prediction.

Preliminaries
This section presents important concepts and techniques in Hyper-relational Knowledge Graph (HKG), including definitions of HKGs, hypergraph learning, global and local structure of HKGs, and multi-position prediction on HKGs.

Hyper-relational Knowledge Graphs
HKGs comprise hyper-relational facts (H-Facts). Typically, an H-Fact can be represented as .., v m ∈ E, r, a 1 , ..., a m ∈ R}, where (s, r, o) represents the main triple and {(a i : v i )} m i=1 represents m auxiliary attribute-value qualifiers.
The link prediction (LP) task on HKGs is to predict missing elements from H-Facts, where missing elements can be entities ∈ {s, o , v 1 , . . . , v m } or relations ∈ {r, a 1 , . . . , a m }.

Hypergraph learning on HKGs
Since there are more than two entities in an H-Fact, we introduce hypergraph learning (Feng et al., 2019) where v ∈ E H , e ∈ H H . h is a fuction to represent the value in I H . For a node v ∈ E H , its degree is defined as d(v) = e∈H H h(v, e), which represents the number of times that a node (entity) appears in different hyperedges (H-Facts) in the whole HKG.

Multi-position prediction on HKGs
Multi-position prediction is a new meaningful task with more practicality than the one-position link prediction task on HKGs. For HKG link prediction, in one main triple, we can predict another element for every two of them we know, i.e., we can predict (s, r, ?), (s, ?, o), (?, r, o). In one auxiliary attribute-value qualifier, if we know any of the attributes and values of one of the auxiliary attribute-value qualifiers, we can predict the other, i.e., (a, ?), (?, v). Thus in practical link prediction, there are some problems have two or more prediction points for example (s, r, ?, a 1 , ?, a 2 , v 2 ) (prediction position one in the main triple and the other in the first attribute-value qualifier) or (s, r, o, a 1 , ?, a 2 , ?) (both predicted positions are in the auxiliary attribute-value qualifiers). We refer to tasks with two or more prediction positions in link prediction tasks as multi-position prediction tasks on HKGs.

Methodology
This section introduces our hyper-relational knowledge graph (HKG) embedding model HAHE, including global and local representation, two hierarchical attention layers, and MLP decoder.

Global and Local Representation
HKGs G = {E, R, H} consist of multiple hyperrelational facts (H-Facts) H with entities E and relations R. Since each H-Fact represents an nary relation (n>2) and has rich, heterogeneous semantic information, we model HKGs in terms of global-level hypergraph-based representation and local-level sequence-based representation with hypergraph dual-attention layers and heterogeneous self-attention layers respectively. The overview of HAHE is illustrated in Figure 3. For global-level representation, we define G H = {E H , H H , I H } to represent the graph structure of entities. Unlike the regular graph, the hyperedges can connect more than two entity nodes. Moreover, we use incidence matrix I H to represent the association information of nodes and hyperedges. For local-level representation, every H-Fact has the structure of one main triple and several auxiliary attribute-value qualifiers .., v m ∈ E, r, a 1 , ..., a m ∈ R}, which represents the semantic information of facts. We can fully connect entities and relations in H-Facts and represent them as heterogeneous semantic sequence structure, containing five kinds of nodes s, r, o, a, v and 14 kinds of edges s − r, where i, j are the serial numbers of different qualifiers.

Hypergraph Dual-Attention Layers
As shown in Figure 4(a), entity embedding first utilizes Hypergraph Dual-Attention Layers to learn hypergraph structural information in global level. Previous Hypergraph Attention Network methods created a transformed ordinary graph by full joining nodes within the same hyperedge and applying GAT. The hypergraph representation loses because the ordinary graph after this transformation cannot distinguish whether two nodes are within the same hyperedge or different hyperedges. We first initialize the entities as nodes with embedding as h v i ∈ R d , where v i ∈ E H and d is dimension of embedding, and initialize the H-Facts as hyperedges with embedding as h e i ∈ R d , where e i ∈ H H . Then the node and hyperedge embeddings are projected into the same space to obtain Wh v i and Wh e i ,where W ∈ R d×d . The attention from nodes to hyper-edges (N-to-H Attention) is performed as follows: where α ij indicates the importance of node v j 's features to hyperedge e i , att is N-to-H Attention function where we choose a single-layer neural network with concatenation operation. v j ∈ e i means we only calculate the attention where node v j is in hypergraph e i , which is indexed by hypergraph incidence matrix I H . Then, the information of nodes is aggregated to hyperedges:  whereh e i ∈ R d is updated hyperedge embeddings and LR, σ are activation functions. After that, we use a similar way to aggregate the information of updated hyperedges back to nodes with Attention (H-to-N Attention) as follows: where β ij denotes the importance between updated hyperedge e j and node v i , andh v i is the updated node embedding. This way, we update the global hypergraph entity embedding and achieve nodehyperedge-node dual-attention message passing. Though we introduce more hyperedge embeddings than ordinary GNNs, they are only used as an intermediate weight variable for hypergraph attention computation. PyG makes dual-attention easy to implement, making it scalable to large graphs like GNNs.
After hypergraph dual-attention layers, H-Facts distribute updated node embeddings to sequence embeddings with relation embeddings, and fed them into Heterogeneous Self-Attention Layers.

Heterogeneous Self-Attention Layers
Each element in the sequence x i ∈ R d has five roles, including s, r, o, a, v, and a total of 14 kinds of edge with other elements in the sequence x j ∈ R d . So, as shown in Figure 4(b), we design a heterogeneous self-attentive layer with both nodebias and edge-bias to learn the local semantic information of the H-Facts. The elements in the sequence pass through this attention layer and update the embedding as follows: where γ ij is the importance between one element in sequence x i and another x j , ∈ R d×d are the linear weight metrics of query, key, value, and five different kinds of x i pass through different weight metrics indexed by role function as the node-bias, and b Q ij , b K ij , b V ij ∈ R d are designed as the edge-bias. Then the sequence embeddings are updated by learning the semantic information inside the H-Facts asx i ∈ R d .

MLP Decoder
Finally, the updated sequence embeddings are selected for the embedding at the position to be predictedx p with MLP decoder to get the prediction distribution and obtain the link prediction results.  obtained after softmax operation, which denotes the similarity probability ofx p with each element in HKG for obtaining the link prediciton answers.

Learning Strategy
The model trains through the final loss, which is calculated by the similarity between the target of prediction and all entities: where y t is the t-th entry of the label y.
For Multi-position Prediction, due to the fully connected attention mechanism, HAHE can accomplish the multi-position prediction task of HKG by masking two or more entities or relations at two or more positions in the same hyper-relational fact. After passing the MLP encoder, HAHE can get the prediction value P i (i ≥ 2) of the corresponding positions, and find the entity or relaiton with the highest similarity among all HKG elements as the prediction results respectively.

Experiments
This section introduces the experimental settings, results and analysis. We answer the following research questions (RQs). RQ1: Can HAHE outperform other Hyper-relational Knowledge Graphs (HKG) embedding models on HKG datasets? RQ2: How does the hierarchical attention mechanism contribute to HAHE? RQ3: How do hypergraph dual-attention mechanisms contribute to HAHE in global level? RQ4: How do heterogeneity of nodes and edges in hyper-relation fact contribute to HAHE in local level? RQ5: How HAHE performs in multi-position prediction tasks on HKGs?

Datasets
We conduct experiments on three hyper-relational datasets JF17K (Wen et al., 2016), WikiPeople (Guan et al., 2019), and WD50K (Galkin et al., 2020), respectively, as shown in Table 1. Among them, JF17K is extracted from Freebase (Bollacker et al., 2008). WikiPeople is obtained by filtering out the statements containing literals in the original WikiPeople dataset, derived from Wikidata (Vrandečić and Krötzsch, 2014) concerning entities of type human, and WD50K is a high-quality hyperrelational dataset with richer hyper-relational facts with auxiliary attribute-value qualifiers.

Baselines
We compare HAHE against a sizable collection of previous hyper-relational approaches namely:

Ablations
To evaluate the significance of HAHE's three main modules, hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity, we obtain 7 simplified model variants by removing any one or two modules from the full model (HAHE-node, HAHE-edge, HAHE-node&edge, HAHE-global, HAHE-global&node, HAHE-global&edge), and the basic variant by removing all three modules.

Evaluation Metrics
Each model predicts entities and relations separately. We split each task of predictions into subject/object prediction in main triples and all entities prediction in whole H-facts to test the model's main triple prediction ability. MRR (the average of reciprocal rankings) and Hits@K (the proportion of top K rankings) for K=1,10 are used to evaluate each link prediction task.

Hyperparameters and Enviroment
The model was trained for 300 epochs using the Adam optimizer with a batch size of 1024 examples   across 1 GeForce GTX 1080Ti on each dataset. Appendix A shows HAHE's optimal hyperparameter settings. Appendix B shows training details.

Main Results (RQ1)
In this experiment, we evaluate our model on the link prediction task. For entity prediction, the results of our model and each variant of our model can be found in Table 2. For relation prediction, the result is shown in Appendix C. We can observe that the HAHE outperforms the other current methods on all three datasets. On JF17K, for the prediction of subject and object, HAHE reports an improvement of 0.6 (0.9%) MRR points, 1.5 (2.7%) H@1, and 3.6 (4.6%) H@10 compared with the best approach. For the prediction of all entities, HAHE reports a gain of 1.2 (1.8%) MRR, 1.5 (2.5%) H@1, and 1.7 (2.1%) H@10 compared with the next-best approach. For the other two datasets, we also have different degrees of improvement. For WD50K, the latest high-quality HKG dataset, our model has the largest improvement over the existing SOTA model GRAN, with an MRR improvement of about 10 points, which proves that this model is more suitable for hypergraph-structured knowledge graphs with hyper-relational facts beyond binary relation.

Ablation Study (RQ2)
The hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity are the three components of HAHE that are required for its operation. We evaluate seven different variants of HAHE, irrespective of whether or not each component is helpful. When evaluating each model variant with a variety of hyperparameters, the results of the optimal prediction were recorded. For different HAHE variants in Figure 5(a), it can be observed that hypergraph dual-attention, node heterogeneity, and edge heterogeneity all contribute to the accurate result of our complete model. In addition, we have outlined the specific results of three primary HAHE variants in Table 2. Each variant lacks a necessary component that is required. Through comparison, our experiment results intu-itively demonstrated the effectiveness of HAHE.
Then, we did more refined ablation analysis to explore the significance of the hypergraph dualattention mechanism in global level and heterogeneity of nodes and edges in local level, respectively.

Analysis of hypergraph dual-attention mechanism in Global Level (RQ3)
We statistically displayed the entity evaluation results of JF17K by the degree of entities in the HKG hypergraph structure to investigate the hypergraph dual attention mechanism. As shown in Figure 5(b), the hypergraph dual-attention mechanism improves the prediction accuracy of entities with different degree. Due to the presence of the attention mechanism, the entity feature information will help in message passing of global information by the hyper-relational facts (hyperedges). For entities with higher degree, the hypergraph dual-attention mechanism can better capture the global features of the hyperrelational facts. Entities with fewer degrees have little impact on capturing global information, but entities with more degrees compensate for this and improve their prediction accuracy.

Analysis of heterogeneity of nodes and edges in Local Level (RQ4)
The experimental results show that both node-bias and edge-bias with heterogeneity can better distinguish different types of entities in an H-fact and different types of relationships between them in local level. Moreover, we find that node-bias plays a more prominent role than edge-bias in the subject/object prediction tasks as shown in Figure 5(c), because the entity roles in the main triplet are more diverse than those in the qualifier. And on the element prediction task in qualifiers, edge-bias is better than node-bias as shown in Figure 5(d), because edge-bias can distinguish the relationship between corresponding attribute-value pairs (a i , v i ) and non-corresponding attribute-value pairs (a j , v i ) rather than node-bias.

Results of Multi-position Prediction tasks on HKGs (RQ5)
JF17K, WikiPeople, and WD50K were applied to test our multi-position prediction model. As an example, in Figure 6, our model outputs the embedding for each position and calculates the joint probability distribution for each candidate answer tuple. Because the increase of predicted positions leads to more answer sets than predicted by the unit placement link, we set the evaluation threshold to keep only the higher scoring answer tuples to obtain the evaluation results.   As shown in Table 3, we evaluated 2-position prediction and 3-position prediction on three HKG datasets, respectively, and used MRR to rank the answer tuples. The test set has three categories: Ent-Ent, Ent-Rel, and Rel-Rel. Ent-Ent and Rel-Rel indicate that all predicted locations are entities or relations, and Ent-Rel indicates that entities and relations are missing jointly. In addition, whether the main triple is complete divides the sample into two categories. "av" indicates that all predicted positions are in the auxiliary qualifiers, while "sro/av" indicates that the main triple has lost a position" and others in qualifiers. "all" includes both categories. According to the results of multi-position prediction tasks, HAHE performs better on the high-quality WD50K dataset, and more positions predicted or one position in the main triple makes the prediction more difficult.

Conclusion
In this paper, we present HAHE, a model with hierarchical attention in global and local level. HAHE outperforms other baselines link prediction tasks on hyper-relational knowledge graphs (HKGs). The experimental results demonstrate that our hypergraph dual-attention layers and heterogeneous selfattention layers are effective in learning the global and local structure of HKGs. We also use HAHE to solve the HKG multi-location prediction task and analyze the results for the first time. graduate Innovation and Entrepreneurship Project led by Haoran Luo.

Limitations
For HKG one-position link prediction tasks, HAHE shows the best performance in all three datasets. However, because HAHE is based on hypergraph learning, it improves more on the WD50K high quality hyper-relational knowledge graph link prediction dataset, and less on the Wikipeople dataset where triples are the majority, so HAHE prefers the fact with more arity numbers. In the future, we will consider extending our approach to triples as a unified architecture.
For HKG multi-position link prediction tasks, it can be seen that our model is effective when predicting multiple missing auxiliary information, which is a frequent situation in practical applications. However, the prediction accuracy of our model needs to be further improved in the case of missing primary relations.

Ethics Statement
This paper investigates the problem of knowledge graph link prediction, aiming at complementing incomplete hyper-relational knowledge graphs using deep learning methods to better promote knowledge graphs for assisted decision-making and intelligent question-and-answer applications. Therefore, we believe it does not violate any ethics.  TransH  ------------------RAE  --------------- Table 5: Comparison of HAHE with other models, composed of relation prediction accuracy on JF17K, WikiPeople and WD50K. Results of the models are mainly taken from the original paper. Best results in each tasks are in bold.