Link Prediction on N-ary Relational Facts: A Graph-based Approach

Link prediction on knowledge graphs (KGs) is a key research topic. Previous work mainly focused on binary relations, paying less attention to higher-arity relations although they are ubiquitous in real-world KGs. This paper considers link prediction upon n-ary relational facts and proposes a graph-based approach to this task. The key to our approach is to represent the n-ary structure of a fact as a small heterogeneous graph, and model this graph with edge-biased fully-connected attention. The fully-connected attention captures universal inter-vertex interactions, while with edge-aware attentive biases to particularly encode the graph structure and its heterogeneity. In this fashion, our approach fully models global and local dependencies in each n-ary fact, and hence can more effectively capture associations therein. Extensive evaluation verifies the effectiveness and superiority of our approach. It performs substantially and consistently better than current state-of-the-art across a variety of n-ary relational benchmarks. Our code is publicly available.


Introduction
Web-scale knowledge graphs (KGs), such as Freebase (Bollacker et al., 2008), Wikidata (Vrandečić and Krötzsch, 2014), and Google Knowledge Vault (Dong et al., 2014), are useful resources for many real-world applications, ranging from Web search and question answering to recommender systems. Though impressively large, these modern KGs are still known to be greatly incomplete and missing crucial facts (West et al., 2014). Link prediction which predicts missing links in KGs has therefore become an important research topic.
Previous studies mainly consider link prediction upon binary relational facts, which encode binary relations between pairs of entities and are usually 1 https://github.com/PaddlePaddle/ Research/tree/master/KG/ACL2021_GRAN  represented as (subject, relation, object) triples. Nevertheless, besides binary relational facts, n-ary relational facts that involve more than two entities are also ubiquitous in reality, e.g., Marie Curie received Nobel Prize in Physics in 1903 together with Pierre Curie and Antoine Henri Becquerel is a typical 5-ary fact. As pointed out by Wen et al. (2016), more than 1/3 of the entities in Freebase actually participate in n-ary relational facts.
Despite the ubiquitousness, only a few studies have examined link prediction on n-ary relational facts. In these studies, an n-ary fact is typically represented as a set of peer attributes (relations) along with their values (entities), e.g., {person: Marie Curie, award: Nobel Prize in Physics, point-intime: 1903, together-with: Pierre Curie, togetherwith: Antoine Henri Becquerel}. Link prediction then is achieved by learning the relatedness either between the values (Zhang et al., 2018;Fatemi et al., 2020) or between the attributevalue pairs (Guan et al., 2019;Liu et al., 2021). This representation inherently assumes that attributes of a same n-ary fact are equally important, which is usually not the case. To further discriminate importance of different attributes, Rosso et al. (2020) and Guan et al. (2020) later proposed to represent an n-ary fact as a primary triple coupled with auxiliary attribute-value descriptions, e.g., in the above 5-ary fact, (Marie Curie, award-received, Nobel Prize in Physics) is the primary triple and point-in-time: 1903, together-with: Pierre Curie, together-with: Antoine Henri Becquerel are auxiliary descriptions. Link prediction then is achieved by measuring the validity of the primary triple and its compatibility with each attribute-value pair. These attribute-value pairs, however, are modeled independently before a final aggregation, thus ignoring intrinsic semantic relatedness in between.
This work in general follows Rosso et al. (2020) and Guan et al. (2020)'s expressive representation form of n-ary facts, but takes a novel graph learning perspective for modeling and reasoning with such facts. Given an n-ary fact represented as a primary subject-relation-object triple (s, r, o) with auxiliary attribute-value pairs {(a i :v i )}, we first formalize the fact as a heterogenous graph. This graph, as we illustrate in Figure 1, takes relations and entities (attributes and values) as vertices, and introduces four types of edges, i.e., subject-relation, object-relation, relation-attribute, and attribute-value, to denote distinct connectivity patterns between these vertices. In this fashion, the full semantics of the given fact will be retained in the graph. Then, based on this graph representation, we employ a fully-connected attention module to characterize inter-vertex interactions, while further introducing edge-aware attentive biases to particularly handle the graph structure and heterogeneity. This enables us to capture not only local but also global dependencies within the fact. Our approach directly encodes each n-ary fact as a whole graph so as to better capture rich associations therein. In this sense, we call it GRAph-based N-ary relational learning (GRAN).
The most similar prior art to this work is STARE (Galkin et al., 2020), which uses a message passing based graph encoder to obtain relation (attribute) and entity (value) embeddings, and feeds these embeddings into a Transformer (Vaswani et al., 2017) decoder to score n-ary facts. Our approach is more neatly designed by (1) excluding the computationalheavy graph encoder which, according to a contemporaneous study (Yu and Yang, 2021), may not be necessary given an expressive enough decoder, and (2) modeling the full n-ary structure of a fact during decoding which enables to capture not only global but also local dependencies therein.
We evaluate our approach on a variety of n-ary link prediction benchmarks. Experimental results reveal that GRAN works particularly well in learn-ing and reasoning with n-ary relational facts, consistently and substantially outperforming current state-of-the-art across all the benchmarks. Our main contributions are summarized as follows: • We present a novel graph-based approach to learning and reasoning with n-ary facts, capable of capturing rich associations therein. • We demonstrate the effectiveness and superiority of our approach, establishing new stateof-the-art across a variety of benchmarks.

Problem statement
This section formally defines n-ary relational facts and the link prediction task on this kind of data.
Definition 1 (N-ary relational fact) An n-ary relational fact F is a primary subject-relation-object triple (s, r, o) coupled with m auxiliary attributevalue pairs {(a i : v i )} m i=1 , where r, a 1 , · · · , a m ∈ R and s, o, v 1 , · · · , v m ∈ E, with R and E being the sets of relations and entities, respectively. We slightly abuse terminology here by referring to the primary relation and all attributes as relations, and referring to the subject, object, and values as entities unless otherwise specified. The arity of the fact is (m+ 2), i.e., the number of entities in the fact.
Definition 2 (N-ary link prediction) N-ary link prediction aims to predict a missing element from an n-ary fact. The missing element can be either an entity ∈ {s, o, v 1 , · · · , v m } or a relation ∈ {r, a 1 , · · · , a m }, e.g., to predict the primary subject of the incomplete n-ary fact (?, r, o) 3 Graph-based n-ary relational learning This section presents GRAN, our graph-based approach to n-ary link prediction. There are two key factors of our approach: graph representation and graph learning. The former represents n-ary facts as graphs, and the latter learns with these graphs to perform inference on n-ary facts.

Graph representation
We elaborate the first key factor: graph representation of n-ary facts. Given an n-ary fact defined as , we reformulate it equivalently as a heterogeneous graph G = (V, L). The vertex set V consists of all entities and relations in the fact, i.e., V = {r, s, o, a 1 , · · ·, a m , v 1 , · · ·, v m }. The link set L consists of (2m+2) undirected edges of four types between the vertices, i.e., • 1 subject-relation edge (s, r), The graph heterogeneity is reflected in that the vertices and links are both typed, with type mapping functions φ : V → {entity, relation} and ψ : L → { subject-relation, object-relation, relation-attribute, attribute-value}, respectively. Figure 1 provides a visual illustration of this heterogenous graph.
As we can see, the graph representation retains the full semantics of a given fact. It also enables us to model the fact as a whole and capture all possible interactions therein, which, as we will show later in our experiments, is crucial for learning with n-ary relational facts.

Graph learning
The second key factor is learning with heterogeneous graphs to perform inference on n-ary facts. Given an incomplete n-ary fact with a missing element, say (?, , which is represented as a heterogeneous graph, we feed the graph into an embedding layer, a stack of L successive graph attention layers, and a final prediction layer to predict the missing element, say s. This whole process is sketched in Figure 2 (left).
The input embedding layer maps the elements of the input n-ary fact or, equivalently, the vertices of the input graph, to their continuous vector representations (the missing element is denoted by a special token [MASK]). The L graph attention layers then repeatedly encode the graph and update its vertex representations. Our graph attention generally inherits from Transformer (Vaswani et al., 2017) and its fully-connected attention which captures universal inter-vertex associations, but further introduces edge-aware attentive biases to particularly handle graph structure and heterogeneity. As such, we call it edge-biased fully-connected attention. After the graph encoding process, we use the representation of the special token [MASK] to predict the missing element. In the rest of this section, we emphasize the edge-biased fully-connected attention, and refer readers to (Vaswani et al., 2017) and Appendix A for other modules of our graph attention layer.
Edge-biased fully-connected attention We are given an input graph G = (V, L), with vertex type mapping function φ and link type mapping function ψ. Vertices are associated with hidden states (x 1 , · · ·, x |V| ) ∈ R dx generated by previous layers. The aim of this attention is to aggregate information from different vertices and update vertex representations, by taking into account the graph structure and its heterogeneity. We employ multi-head attention with H heads, each applied independently to the input (x 1 , · · ·, x |V| ) ∈ R dx to generate updated vertex representations (z h 1 , · · ·, z h |V| ) ∈ R dz for h = 1, · · ·, H. These updated vertex representations are concatenated and linearly transformed to generate final attention output. We set d x = d and d z = d H for all layers and heads. Below we describe the specific design of each head, and we drop the head index h for notational brevity.
Our attention follows the traditional query-keyvalue attention (Vaswani et al., 2017). Specifically, for each input x i , we project it into a triple of query, key, and value as ( respectively. Then we measure the similarity between each pair of vertices, say i and j, as a scaled dot product of i's query and j's edge-biased key: After we obtain the similarity scores α ij , a softmax operation is applied, and the edge-biased values are aggregated accordingly to generate the updated representation for each vertex i: We call this attention fully-connected as it takes into account similarity between any two vertices i and j. We call it edge-biased as it further introduces attentive biases e K ij , e V ij ∈ R dz to encode the typed edge between i and j, one to generate edge-biased key (cf. Eq. (1)) and the other edge-biased value (cf. Eq. (2)). Introducing e K ij enables our attention to encode not only global dependencies that universally exist between any pair of vertices, but also local dependencies that are particularly indicated by typed edges. Introducing e V ij further propagates edge information to the attention output. If there is no edge linking i and j we set e K ij = e V ij = 0, which, at this time, degenerates to the conventional fullyconnected attention used in Transformer (Vaswani et al., 2017). As the attentive biases e K ij , e V ij can be designed freely to meet any desired specifications, this attention is in essence quite flexible, capable of modeling arbitrary relationships between the input elements. This idea has actually been applied, e.g., to model relative positions between words within sentences (Shaw et al., 2018;Wang et al., 2019a), or to model various kinds of mention dependencies for relation extraction (Xu et al., 2021).
Edge-aware attentive biases We now elaborate how e K ij and e V ij are specifically designed for n-ary facts. Recall that given an n-ary fact represented as a heterogeneous graph G = (V, L), there are 4 distinct types of edges in the graph: subject-relation, object-relation, relation-attribute, attribute-value. To each we assign a pair of key and value biases. The attentive biases between vertices i and j are then defined as the biases associated with the type of the edge linking i and j: Here e K k , e V k ∈ R dz for k = 1, 2, 3, 4 are the key and value biases corresponding to the 4 edge types, shared across all layers and heads. In this way, the graph structure (whether there is an edge between two vertices) and its heterogeneity (which type the edge is between two vertices) can be well encoded into the attentive biases, and then propagated to the final attention output. Figure 2 (right) visualizes the edge-biased attention between pairs of vertices in an n-ary fact.

Model training
We directly use n-ary link prediction as our training task. Specifically, given an n-ary fact F= (s, r, o), in the training set we create (2m + 3) training instances for it, each to predict a missing element (either an entity or a relation) given other elements in the fact, e.g., (?, r, o) is to predict the primary subject and the answer to which is s. Here and in what follows we denote a training instance as F, with the missing element indicated by a special token [MASK]. This training instance is reformulated as a heterogeneous graph G with vertices (x 1 , · · ·, x k ), where k = 2m + 3 is the total number of vertices therein. The label is denoted as y. We have y ∈ E for entity prediction and y ∈ R for relation prediction.
Each training instance F or, equivalently, the corresponding graph G is fed into the embedding, graph attention, and final prediction layers to predict the missing element, as we introduced above. Suppose after the successive graph attention layers we obtain for the vertices (x 1 , · · ·, x k ) their hidden states (h 1 , · · ·, h k ) ∈ R d . The hidden state corresponding to [MASK], denoted as h for brevity, is used for the final prediction. The prediction layer is constructed by two linear transformations followed by a standard softmax operation: Here, we share W 2 with the weight matrix of the input embedding layer, and W 1 , b 1 , b 2 are freely learnable. The final output p is a probability distribution over entities in E or relations in R, depending on the type of the missing element. We use the cross-entropy between the prediction and the label as our training loss: where p t is the t-th entry of the prediction p, and y t the t-th entry of the label y. As a one-hot label restricts each prediction task to a single answer, which might not be the case in practice, we employ label smoothing to lessen this restriction. Specifically, for entity prediction, we set y t = 1− (e) for the target entity and y t = (e) |E|−1 for each of the other entities, where (e) is a small entity label smoothing rate. For relation prediction y t is set in a similar way, with relation label smoothing rate (r) . The loss is minimized using Adam optimizer (Kingma and Ba, 2015). We use learning rate warmup over the first 10% training steps and linear decay of the learning rate. We also use batch normalization and dropout after each layer and sub-layer to regularize, stabilize, and speed up training.
Unlike previous methods which score individual facts and learn from positive-negative pairs (Rosso et al., 2020;Guan et al., 2020), our training scheme bears two advantages: (1) Directly using n-ary link prediction as the training task can effectively avoid training-test discrepancy.
(2) Introducing a special token [MASK] enables us to score a target element against all candidates simultaneously, which accelerates convergence during training and speeds up evaluation drastically (Dettmers et al., 2018).

Experiments and results
We evaluate GRAN in the link prediction task on n-ary facts. This section presents our experiments and results.

Datasets
We consider standard n-ary link prediction benchmarks including: JF17K (Zhang et al., 2018) 2 is collected from Freebase. On this dataset, an n-ary relation is predefined by a set of attributes, and facts of this relation should have all corresponding values completely given. Take music.group membership as an example. All facts of this relation should get three values w.r.t. the predefined attributes, e.g., (Guitar, Dean Fertita, Queens of the Stone Age). The maximum arity of the relations there is 6.
WikiPeople (Guan et al., 2019) 3 is derived from Wikidata concerning entities of type human. On this dataset, n-ary facts are already represented as primary triples with auxiliary attribute-value pairs, which is more tolerant to data incompleteness. The maximum arity there is 9. As the original dataset also contains literals, we follow (Rosso et al., 2020;Galkin et al., 2020) and consider another version that filters out statements containing literals. This filtered version is referred to as WikiPeople − , and the maximum arity there is 7.
For JF17K and its subsets, we transform the representation of an n-ary fact to a primary triple coupled with auxiliary attribute-value pairs. We follow (Rosso et al., 2020;Galkin et al., 2020) and directly take the values corresponding to the first and second attributes as the primary subject and object, respectively. Other attributes and values are taken as auxiliary descriptions. Facts on each dataset are split into train/dev/test sets, and we use the original split. On JF17K which provides no dev set, we split 20% of the train set for development. The statistics of these datasets are summarized in Table 1.

Baseline methods
We compare against the following state-of-the-art n-ary link prediction techniques: RAE (Zhang et al., 2018) represents an n-ary fact as an (n + 1)-tuple consisting of the predefined relation and its n values. It generalizes a binary link prediction method TransH (Wang et al., 2014) to the higher-arity case, which measures the validity of a fact as the compatibility between its n values.
NaLP (Guan et al., 2019) and RAM (Liu et al., 2021) represent an n-ary fact as a set of attributevalue pairs. Then, NaLP employs a convolutional neural network followed by fully connected neural nets to model the relatedness of such attribute-value pairs and accordingly measure the validity of a fact. RAM further encourages to model the relatedness between different attributes and also the relatedness between an attribute and all involved values.
HINGE (Rosso et al., 2020) and NeuInfer(Guan et al., 2020) regard an n-ary fact as a primary triple with auxiliary attribute-value pairs. Then they deploy neural modules to measure the validity of the primary triple and its compatibility with each auxiliary description, and combine these modules to obtain the overall score of a fact. As different auxiliary descriptions are modeled independently before aggregation, these two methods show limited ability to model full associations within n-ary facts.
STARE (Galkin et al., 2020) is a recently proposed method generalizing graph convolutional networks (Kipf and Welling, 2017) to n-ary relational KGs. It employs a message passing based graph encoder to obtain entity/relation embeddings, and feeds these embeddings to Transformer decoder to score n-ary facts. Hy-Transformer (Yu and Yang, 2021) replaces the graph encoder with light-weight embedding processing modules, achieving higher efficiency without sacrificing effectiveness. These two methods employ vanilla Transformer decoders, ignoring specific n-ary structures during decoding.
n-CP, n-TuckER, and GETD (Liu et al., 2020) are tensor factorization approaches to n-ary link prediction. They all follow RAE and represent each n-ary fact as an (n+1)-tuple. A whole KG can thus be represented as a binary valued (n + 1)-way ten- sor X ∈ {0, 1} |R|×|E|×···×|E| , where x = 1 means the corresponding fact is true and x = 0 otherwise. X is then decomposed and approximated by a lowrank tensorX that estimates the validity of all facts. Different tensor decomposition strategies can be applied, e.g., n-CP generalizes CP decomposition (Kruskal, 1977) and n-TuckER is built on TuckER (Balazevic et al., 2019). As the tensor representation inherently requires all facts to have the same arity, these methods are not applicable to datasets of mixed arities, e.g., JF17K and WikiPeople.

GRAN variants
We evaluate three variants of GRAN to investigate the impact of modeling graph structure and heterogeneity, including: GRAN-hete is the full model introduced above. It uses edge representations defined in Eq. (3), which encode both graph structure (whether there is an edge) and heterogeneity (which type the edge is).
GRAN-homo retains graph structure but ignores heterogeneity. There are only two groups of edge attentive biases: (e K ij , e V ij ) = (0, 0) or (e K ij , e V ij ) = (e K , e V ). The former is used if there is no edge between vertices i and j, while the latter is employed whenever the two vertices are linked, irrespective of the type of the edge between them. This in essence views an n-ary fact as a homogeneous graph where all edges are of the same type.
GRAN-complete considers neither graph structure nor heterogeneity. It simply sets (e K ij , e V ij ) = (0, 0) for all vertex pairs. The edge-biased attention thus degenerates to the conventional one used in Transformer, which captures only global dependencies between vertices. This in essence regards an n-ary fact as a complete graph in which any two vertices are connected by an (untyped) edge. STARE and Hy-Transformer are most similar to this variant.
We use the following configurations for all variants of GRAN: L = 12 graph attention layers, H = 4 attention heads, hidden size d = 256, batch size b = 1024, and learning rate η = 5e−4, fixed across all the datasets. Besides, on each dataset, we tune entity/relation label smoothing rate (e) / (r) , dropout rate ρ, and training epochs τ in their respective ranges. The optimal configuration is determined by dev MRR. We leave the tuning ranges and optimal values of these hyperparameters to Appendix B. After determining the optimal configuration on each dataset, we train with a combination of the train and dev splits and evaluate on the test split, as practiced in (Galkin et al., 2020).

Evaluation protocol and metrics
During evaluation, we distinguish between entity prediction and relation prediction. Take entity prediction as an example. For each test n-ary fact, we replace one of its entities (i.e., subject, object, or an auxiliary value) with the special token [MASK], feed the masked graph into GRAN, and obtain a predicted distribution of the answer over all entities ∈ E. Then we sort the distribution probabilities in descending order and get the rank of the correct answer. During ranking, we ignore facts that already exist in the train, dev, or test split. We repeat this whole procedure for all specified entities in the test fact, and report MRR and Hits@k for k = 1, 10 aggregated on the test split. MRR is the average of reciprocal rankings, and Hits@k is the proportion of top k rankings (abbreviated as H@k). The same evaluation protocol and metrics also apply to relation prediction, where a relation can be either the primary relation or an auxiliary attribute.    (Guan et al., 2019(Guan et al., , 2020, and those for predicting the primary relation taken from (Rosso et al., 2020). Best scores are highlighted in bold, and "-" denotes missing scores.

Results on datasets of mixed arities
tings: (1) predicting all entities s, o, v 1 , · · · , v m in an n-ary fact and (2) predicting only the subject s and object o. This enables us to make a direct comparison to previous literatures (Guan et al., 2020;Rosso et al., 2020;Galkin et al., 2020). From the results, we can see that (1) The optimal setting of our approach offers consistent and substantial improvements over all the baselines across all the datasets in almost all metrics, showing its significant effectiveness and superiority in entity prediction within n-ary facts.
(2) All the variants, including the less expressive GRAN-homo and GRAN-complete, perform quite well, greatly surpassing the competitive baselines in almost all cases except for the WikiPeople − dataset. This verifies the superior effectiveness of modeling n-ary facts as whole graphs so as to capture global dependencies between all relations and entities therein.
(3) Among the variants, GRAN-hete offers the best performance. This demonstrates the necessity and superiority of further modeling specific graph structures and graph heterogeneity, so as to capture local dependencies reflected by typed edges linking relations and entities. Table 3 further shows relation prediction results on these datasets. Again, to make direct comparison with previous literatures, we consider two settings: (1) predicting all relations including the primary relation r and auxiliary attributes a 1 , · · · , a m and (2) predicting only the primary relation r. Here, on each dataset, GRAN models are fixed to their respective optimal configurations (see Appendix B) determined in the entity prediction task. 6 The results show that GRAN variants perform particularly well in relation prediction. Among these variants, GRAN-hete performs the best, consistently outperforming the baselines and achieving extremely high performance across all the datasets. This is because relation prediction is, by nature, a relatively easy task due to a small number of candidate answers.    Table 4 presents entity prediction results on the four subsets of JF17K and WikiPeople, which consist solely of 3-ary or 4-ary facts. Here, an entity means either the subject, the object, or an attribute value. On these four single-arity subsets, tensor factorization based approaches like n-CP, n-TuckER, and GETD apply quite well and have reported promising performance . From the results, we can observe similar phenomena as from Table 2. The GRAN variants perform particularly well, all surpassing or at least performing on par with the baselines across the datasets. And GRAN-hete, again, offers the best performance in general among the three variants.

Further analysis
We further look into the breakdown entity prediction performance of the GRAN variants on different arities. More specifically, we group the test split of each dataset into binary (n = 2) and n-ary (n > 2) categories. Entity prediction means predicting the subject/object for the binary category, or predicting an attribute value in addition for the n-ary category. Table 5 presents the breakdown MRR scores in all these different cases on JF17K, WikiPeople, and WikiPeople − , with the GRAN variants set to their respective optimal configurations on each dataset (see Appendix B). Among the variants GRAN-hete performs best in all cases, which again verifies the necessity and superiority of modeling n-ary facts as heterogeneous graphs. Ignoring the graph heterogeneity (GRAN-homo) or further graph structures (GRAN-complete) always leads to worse performance, particularly when predicting auxiliary attribute values in higher-arity facts.

Related work
Link prediction on binary relational data Most previous work of learning with knowledge graphs (KGs) focused on binary relations. Among different binary relational learning techniques, embedding based models have received increasing attention in recent years due to their effectiveness and simplicity. The idea there is to represent symbolic entities and relations in a continuous vector space and measure the validity of a fact in that space. This kind of models can be roughly grouped into three categories: translation distance based (Bordes et al., 2013;Wang et al., 2014;Sun et al., 2019), semantic matching based (Trouillon et al., 2016;Balazevic et al., 2019), and neural network based (Dettmers et al., 2018;Schlichtkrull et al., 2018), according to the design of validity scoring functions. We refer readers to (Nickel et al., 2016;Wang et al., 2017;Ji et al., 2021) for thorough reviews of the literature.
Link prediction on n-ary relational data Since binary relations oversimplify the complex nature of the data stored in KGs, a few recent studies have started to explore learning and reasoning with n-ary relational data (n > 2), in particular via embedding based approaches. Most of these studies represent n-ary facts as tuples of pre-defined relations with corresponding attribute values, and generalize binary relational learning methods to the n-ary case, e.g., m-TransH (Wen et al., 2016) and RAE (Zhang et al., 2018) generalize TransH (Wang et al., 2014), a translation distance based embedding model for binary relations, while n-CP, n-TuckER, and GETD  generalize 3-way tensor decomposition techniques to the higher-arity case. NaLP (Guan et al., 2019) and RAM (Liu et al., 2021) are slightly different approaches which represent n-ary facts directly as groups of attribute-value pairs and then model relatedness between such attributes and values. In these approaches, however, attributes of an n-ary fact are assumed to be equally important, which is often not the case in reality. Rosso et al. (2020) and Guan et al. (2020) therefore proposed to represent n-ary facts as primary triples coupled with auxiliary attribute-value pairs, which naturally discriminates the importance of different attributes. The overall validity of a fact is then measured by the validity of the primary triple and its compatibility with each attribute-value pair. STARE (Galkin et al., 2020) follows the same representation form of n-ary facts, and generalizes graph convolutional networks to n-ary relational KGs to learn entity and relation embeddings. These embeddings are then fed into a Transformer decoder to score n-ary facts. Nevertheless, during the decoding process STARE takes into account solely global dependencies and ignores the specific n-ary structure of a given fact.
Transformer and its extensions Transformer (Vaswani et al., 2017) was initially devised as an encoder-decoder architecture for machine translation, and quickly received broad attention across all areas of natural language processing (Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019). Transformer uses neither convolution nor recurrence, but instead is built entirely with (self-) attention layers.
Recently, there has been a lot of interest in modifying this attention to further meet various desired specifications, e.g., to encode syntax trees (Strubell et al., 2018;Wang et al., 2019c), character-word lattice structures , as well as relative positions between words (Shaw et al., 2018;Wang et al., 2019a). There are also a few recent attempts that apply vanilla Transformer (Wang et al., 2019b) or hierarchical Transformer (Chen et al., 2020) to KGs, but mainly restricted to binary relations and deployed with conventional attention. This work, in contrast, deals with higher-arity relational data represented as heterogeneous graphs, and employs modified attention to encode graph structure and heterogeneity.

Conclusion
This paper studies link prediction on higher-arity relational facts and presents a graph-based approach to this task. For each given n-ary fact, our approach (1) represents the fact as a heterogeneous graph in which the semantics of the fact are fully retained; (2) models the graph using fully-connected attention with edge-aware attentive biases so as to capture both local and global dependencies within the given fact. By modeling an n-ary fact as a whole graph, our approach can more effectively capture entity relation associations therein, which is crucial for inference on such facts. Link prediction results on a variety of n-ary relational benchmarks demonstrate the significant effectiveness and superiority of our approach.
As future work, we would like to (1) verify the effectiveness of GRAN on newly introduced benchmarks such as WD50K (Galkin et al., 2020) and FB-AUTO (Fatemi et al., 2020); (2) investigate the usefulness of specific modules, e.g., positional embeddings and various forms of attentive biases in GRAN; and (3) integrate other types of data in a KG, e.g., entities's textual descriptions, for better n-ary link prediction.

A Graph attention layers
After the input embedding layer, we employ a stack of L identical graph attention layers to encode the input graph before making final predictions. These graph attention layers generally follow the design of Transformer encoder (Vaswani et al., 2017), each of which consists of two sub-layers, i.e., an edgebiased fully-connected attention sub-layer followed by an element-wise feed-forward sub-layer. The attention sub-layer, as illustrated in Section 3.2, relates different vertices of the input graph to update its vertex representations. It computes attention in an edge-biased fully-connected fashion, which thus is able to capture both global and local dependencies within the graph. The feed-forward sub-layer is composed of two linear transformations with a GELU activation (Hendrycks and Gimpel, 2016) in between, applied to each element/vertex separately and identically. We further introduce residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) around each graph attention layer and its sub-layers. To facilitate these residual connections, all the layers and their sub-layers produce outputs of the same dimension d.

B Hyperparameter settings
We use the following hyperparameter settings for GRAN: L = 12 graph attention layers, H = 4 attention heads, hidden size d = 256, batch size b =1024, and learning rate η = 5e−4. These configurations are fixed across all the datasets. Besides, on each dataset we tune the following hyperparameters in their respective ranges: • entity label smoothing rate ( We determine the optimal configuration for GRANhete by dev MRR of entity prediction on each dataset. And then we directly set GRAN-homo and GRANcomplete to the same configuration. Table 6 presents the optimal configuration on each dataset.

C Infrastructure and runtime
We train all the GRAN variants on one 16G V100 GPU. With the hyperparameter settings specified in Appendix B, it takes about 3 hours to finish training and evaluation on JF17K, 17 hours on Wikipeople, 10 hours on Wikipeople − , 1 hour on JF17K-3, 0.5 hour on JF17K-4, Wikipeople-3, and WikiPeople-4. This runtime covers the whole training and evaluation process. Compared to previous methods like HINGE (Rosso et al., 2020) and NeuInfer (Guan et al., 2020) which score individual facts and learn from positive-negative pairs, GRAN directly scores each target answer against all candidates in a single pass and drastically speeds up evaluation. GRAN is also much more efficient than STARE (Galkin et al., 2020), which is a graph encoder plus Transformer decoder architecture. By eliminating the computational heavy graph encoder, GRAN requires significantly less running time but still achieves better performance than STARE, e.g., GRAN-hete achieves .617 MRR within 3 hours while STARE takes about 10 hours to achieve .574 MRR on JF17K; GRANhete achieves .503 MRR within 10 hours but STARE takes about 4 days to achieve a similar MRR on Wikipeople − (which is 9-10 times slower).