Friendly Neighbors: Contextualized Sequence-to-Sequence Link Prediction

We propose KGT5-context, a simple sequence-to-sequence model for link prediction (LP) in knowledge graphs (KG). Our work expands on KGT5, a recent LP model that exploits textual features of the KG, has small model size, and is scalable. To reach good predictive performance, however, KGT5 relies on an ensemble with a knowledge graph embedding model, which itself is excessively large and costly to use. In this short paper, we show empirically that adding contextual information — i.e., information about the direct neighborhood of the query entity — alleviates the need for a separate KGE model to obtain good performance. The resulting KGT5-context model is simple, reduces model size significantly, and obtains state-of-the-art performance in our experimental study.


Introduction
A knowledge graph (KG) is a collection of facts describing relations between real-world entities. Facts are represented in the form of subjectrelation-object ((s, r, o)) triples such as (Brendan Fraser, hasWonPrize, Oscar). In this paper, we study the link prediction (LP) problem, which is to infer missing links in the KG. We focus on KGs in which the entities and relations have textual features (such as mentions or descriptions). Saxena et al. (2022) made a case for large language models (LM) for this task. They proposed the KGT5 model, which posed the link prediction problem as a sequence-to-sequence (seq2seq) task. The main advantages of this approach are that (i) it allows for small model sizes and (ii) it decouples inference cost from the graph size. They found that KGT5's performance was particularly strong when predicting the object of new relations for a query entity (e.g., the birthplace of a person), but fell short of alternative approaches when predicting additional objects for a known relation (e.g., additional awards won by someone).
To avoid this problem, Saxena et al. (2022) used an ensemble of KGT5 with a large knowledge graph embedding (KGE) model (ComplEx (Trouillon et al., 2016)). This ensemble did reach good performance, but destroyed both advantages (i) and (ii) of using a LM. In fact, KGE models learn a low-dimensional representation of each entity and each relation in the graph (Bordes et al., 2013;Sun et al., 2019a;Trouillon et al., 2016). Consequently, model size and LP cost are linear in the number of entities in the graph, which can be expensive to use for large-scale KGs. For example, the currently best-performing model (Cattaneo et al., 2022) for a large-scale benchmark consists of an ensemble of 85 KGE models; each taking up more than 86GB of space for parameters. Though KGE model sizes can be reduced by using compositional embeddings based on text mentions (Wang et al., 2021;Clouatre et al., 2021;Wang et al., 2022), inference cost remains excessively high for large graphs.
We propose and study KGT5-context, which expands on KGT5 by providing contextual information about the query entity -i.e., information about the direct neighborhood of the query entity -to facilitate link prediction. Our work is motivated by the KGE model HittER (Chen et al., 2021), which follows a similar approach; we use the seq2seq model KGT5 instead of a Transformerbased KGE model. KGT5-context is very simple: The only change to KGT5 is that we add a verbalization of the neighborhood of the query entity to the description of a given LP task; see Fig. 1 for an example. KGT5-context retains advantages (i) and (ii) of KGT5.
We performed an experimental study using the Wikidata5M (Wang et al., 2021) and WikiKG90Mv2 (Hu et al., 2021) benchmarks. We found that-without further hyperparameter tuning-KGT5-context reached or exceeded stateof-the-art performance on both benchmarks using a significantly smaller model size than alternative Figure 1: Overview of KGT5-context (at bottom) and comparison to KGT5 (on top); real example from Wiki-data5M, best viewed in color. KGT5-context differs from KGT5 in that it appends the neighboring relations and entities of Y ambao (a drama movie) to the verbalized query. Both models then apply T5, sample predictions from the decoder, map the samples to entities, and rank by sample logit scores.
approaches. The simple KGT5-context model thus provides a suitable baseline for further research.

Expanding KGT5 with Context
Given a query (s, r, ?) and a KG, LP is the task to predict new answer entities, i.e., the ? slot of the query. An example is given in Fig. 1.
KGT5 (Saxena et al., 2022) treats link prediction as a seq2seq task. It exploits available textual information for entities and relations, such as mention names (for both entities and relations). KGT5's architecture is based on the encoder-decoder Transformer model T5 (Raffel et al., 2020). It uses canonical mentions to verbalize the LP query to a text sequence of form "predict tail: <subject mention> | <relation mention> | ". To predict answers, KGT5 samples (exact) candidate mentions from the decoder; the cost of sampling is independent of the number of entities in the KG. To train KGT5, Saxena et al. (2022) use standard training techniques for LLMs: KGT5 is trained on facts in the KG and asked to generate the true answer (using teacher forcing and a cross-entropy loss).
KGT5-context (ours) proceeds in the same way as KGT5 but extends the verbalization of the query. In particular, we append a textual sequence of the one-hop neighborhood of the query entity s to the verbalized query of KGT5. As a result, the query entity is contextualized, an approach that has been applied successfully to KGE models before (Chen et al., 2021). KGT5-context simplifies the prediction problem because additional information that is readily available in the KG is provided along with the query. In the example of Fig. 1, the contextual information states that Y ambao is a Mexican movie; this information already rules out the two top predictions of KGT5 (which suggest that Y ambao is a piece of music). For a more detailed analysis, see Sec. 3.3.
Verbalization details. To summarize, we obtain mentions of the entities and relations in the query as well as in the one-hop neighborhood of the query entity. We use these mentions to verbalize the query together with the neighborhood as "query: <query entity> | <query relation> | context: <context 1 relation> | <context 1 entity> <SEP> ...". To keep direction of relations, we prepend the relation mention with reverse of if the query entity acts as an object (i.e., the relation "points towards" the query entity). We randomly shuffle the ordering of the neighborhood before verbalization. We also limit the neighborhood size (default: k = 100); if the query entity has a larger neighborhood, we sample relation-neighbor pairs uniformly and at random. A real-world example is given in Fig. 1.

Experimental Study
We conducted an experimental study to investigate (i) to what extent integrating context in terms of the entity neighborhood into KGT5 improves link prediction performance, (ii) whether the use of context can mitigate the necessity for an ensemble of the text-based KGT5 model with a KGE model, and (iii) for what kind of queries context is helpful. We found that: 1. KGT5-context improved the state-of-the-art performance on Wikidata5M using a smaller model (Tab. 1).
2. KGT5-context was orders of magnitudes smaller than the leading models on WikiKG90Mv2 and reached competitive performance (Tab. 2). 1 3. KGT5-context did not benefit further from ensembling with a KGE model(Tab. 3).

Experimental Setup
Source code, configuration, and models will be made publicly available.
Datasets. We evaluate KGT5-context on two commonly used large-scale link prediction benchmarks. Wikidata5M (Wang et al., 2021) is the induced graph of the 5M most-frequent entities of the Wikidata KG. WikiKG90Mv2 (Hu et al., 2021) contains more than 90M entities and over 600M facts. In contrast to Wikidata5M, it is only evaluated on tail prediction, i.e., (s, r, ?) queries. Dataset statistics are summarized in Tab. 4 in Sec. A.
Metrics. We follow the standard procedure to evaluate model quality for the link prediction task. In particular, for each test triple (s, r, o), we rank all triples of the form (s, r, ?) (and (?, r, o) on Wiki-data5M) by their predicted scores. For KGT5 and KGT5-context, we instead sample from the decoder and ignore outputs that do not correspond to an existing entity mention. For all models, we filter out all true answers other than the test triple that occur either in the train, valid or test data. Finally, we determine the mean reciprocal rank (MRR) and Hits@K over all test triples. In case of ties, we use the mean rank to avoid misleading results (Sun et al., 2019b).
Settings. We mainly follow the setting of KGT5. For all experiments, we used the same T5 architecture (T5-small for Wikidata5M, T5-base for WikiKG90Mv2) without any pretrained weights. This training from scratch ensures test data is truly unseen during (pre-)training and avoids leakage. We used the SentencePiece tokenizer pretrained by (Raffel et al., 2020). We trained on 8 A100-GPUs with a batch size of 32 (effective batch size of 256) using the AdaFactor optimizer. No dataset-specific hyperparameter optimization was performed and models were trained until the MRR on the validation set did not improve for 5 epochs. For KGT5-context, we sampled up to 100 neighbors per query entity or up to an input sequence length of 512 tokens. For inference, we obtained 500 samples from the decoder.

Link Prediction Performance
Link prediction performance on Wikidata5M is shown in Tab. 1; additional baselines are given in Tab. 5 (appendix). We found that KGT5-context outperformed traditional KGEs by up to 7pp in terms of MRR, with a model size reduction of 90-98%. Likewise, KGT5-context improved on KGT5 by 7pp, on the KGT5+Complex ensemble by 4pp, and on the current state-of-the-art model SimKGC by 2pp.
The results on the largest benchmark WikiKG90Mv2 are shown in Tab. 2. Here, KGT5-context is multiple orders of magnitude smaller than the currently best-performing models 1 , and improves validation MRR by almost 1pp. 2

Analysis
To investigate in which cases context information was beneficial, we empirically analyzed LP performance w.r.t. (i) query frequency and (ii) the degree of the query entity. We also sampled predictions and summarize our general observations. Query frequency. The frequency of a test query (s, r, ?) is the number of answers to the query already available in the training data. For example, queries for N:1 relations have frequency 0, whereas queries for 1:N relations can have large frequency for high-degree query entities. We bucketized the test queries of Wikidata5M into low, medium, and high frequency queries and report average MRR for various models in Tab. 3. Generally, high-frequency queries appear harder to answer. These queries have many known true answers already (tying up model capacity) and additional answers might be largely unrelated. In contrast, a low-frequency query such as (BrendanF raser, instanceOf , ?) has few or no known answer and might be easier to infer, even when the combination of this particular subject and relation was not yet seen during training.
In general, KGT5 performed well on queries that did not occur in the training data but was outperformed by a large amount by ComplEx on queries seen multiple times. Hence, both models complement each other in an ensemble. KGT5-context strongly improved performance over KGT5 for low-1 The parameter count in Tab. 2 corresponds to the size of the largest model in an ensemble, not the overall model size. For example, BESS consists of 85 models and the complete ensemble has 2.6T parameters; the KGT5-context model is 5 orders of magnitude smaller. 2 We directly used mentions of entities and relations for WikiKG90Mv2, instead of the textual embeddings used by other models. For this reason, the benchmark authors (Hu et al., 2021) did not provide us with scores on the hidden test set. The mentions used to be provided with the dataset but have been removed by now; we obtained them from https: //github.com/apoorvumang/kgt5.  frequency queries, however, even outperforming ComplEx (in the 1-10 bucket). For this reason, an ensemble between KGT5-context and ComplEx only brought negligible benefits (and substantial drawbacks): KGT5-context does not need to be ensembled with a KGE model for good results.

Model
Entity degree. We also investigated the benefit of contextual information w.r.t. to the degree of the query entity; see Fig. 2 in Sec. A. We found that KGT5-context was beneficial and performed well on query entities with a degree up to 100. For entities with a very large degree (i.e., nodes with more than 100 or even 1000 neighbors), ComplEx showed benefits. As before, we feel that these performance benefits are negligible considering the large model size.
Anecdotal results. When we studied the predictions of KGT5-context, we found the context is especially beneficial when (i) the entity mention only provides limited information about the entity, and when (ii) the answer to the query is contained in the 1-hop neighborhood.
A case of (i) is shown in the real example shown in Fig. 1. Here, KGT5 was able to capture the geographic region of the real-world entity only based on its mention. Based on this geographic notion, it proposed the music genre latinP op but was unaware that the entity is a movie. This useful information can be obtained directly from the 1-hop neighborhood and is exploited by KGT5-context. For Wikidata5M, the correct answer entity appears in the 1-hop neighborhood of the query entity form about 7% of the validation triples. But even when the answer does not directly appear in the context, it may contain entities strongly hinting at the correct answer. For example, it is easier to predict that an entity has occupation biochemist, when the context already contains the information that the entity is a chemist.

Conclusion
We proposed and studied KGT5-context, a sequence-to-sequence model for link prediction in knowledge graphs. KGT5-context extends the prior KGT5 model by using contextual information of the query entity for prediction. KGT5-context is simple and small, and it obtained state-of-the-art performance in our experimental study. It thus provides a suitable baseline for further research in this area. A natural direction, for example, is to explore approaches that integrate context information in a less naive way than KGT5-context.

Limitations
KGT5-context relies on the textual mentions of entities and relations. Therefore, it is only applicable to KGs with text features. While this may not always be the case, KGT5-context may also be able to handle individual entities without textual features when well-described by their neighborhood.
The verbalized neighborhood of the query entity leads to long input sequences. These long sequences lead to higher memory consumption and longer epoch times during training. Overall, training cost is typically higher compared to traditional KGE models, which can be tuned  and trained efficiently (Lerer et al., 2019;Kochsiek and Gemulla, 2021;Zheng et al., 2020).
Finally, to use KGT5-context in practice, the KG has to be queried in order to obtain context information.