Relation Specific Transformations for Open World Knowledge Graph Completion

We propose an open-world knowledge graph completion model that can be combined with common closed-world approaches (such as ComplEx) and enhance them to exploit text-based representations for entities unseen in training. Our model learns relation-specific transformation functions from text-based to graph-based embedding space, where the closed-world link prediction model can be applied. We demonstrate state-of-the-art results on common open-world benchmarks and show that our approach benefits from relation-specific transformation functions (RST), giving substantial improvements over a relation-agnostic approach.


Introduction
Knowledge graphs are an interesting source of information that can be exploited by retrieval (Dong et al., 2014) and question answering systems (Ferrucci et al., 2010). They are, however, known to be inherently sparse (Paulheim, 2017). To overcome this problem, knowledge graph completion (KGC) enriches graphs with new triples. While most existing approaches require all entities to be part of the training graph, for many applications it is of interest to infer knowledge about entities not present in the graph i.e. open-world entities. Here, approaches usually assume some text describing the target entity to be given, from which an entity representations can be inferred, for example via text embeddings (Mikolov et al., 2013;Devlin et al., 2018). To the best of our knowledge, only a few such open-world KGC approaches have been proposed so far (Xie et al., 2016;Shi and Weninger, 2017a;Shah et al., 2019).
We suggest a simple yet effective approach towards open-world KGC 1 : Similar to Shah et al. (2019)'s OWE model, our approach enables existing KGC models to perform open-world prediction: Given an open-world entity, its name and description are aggregated into a text-based entity representation, and a transformation from text-based embedding space to graph-based embedding space is learned, where the closed-world KGC model can be applied. However, while OWE's transformation only takes the open-world entity into account, our approach also utilizes the target triple's relation, such that specific mappings are learned for different relations such as birthplace, spouse, or located in (see Figure 1). We demonstrate that this extension comes with strong improvements, yielding state-of-the-art results on common open-world datasets.

Related Work
Interest in KGC has increased recently, with most of the work focusing on embedding-based approaches. Earlier approaches (Nickel et al., 2016) have recently been complemented by other models such as DistMult (Yang et al., 2014), TransR (Lin et al., 2015), ComplEx (Trouillon et al., 2016), ProjE (Shi and Weninger, 2017b) and RotatE (Sun et al., 2019). The above models estimate the probability of triples (head, rel, tail) using a scoring function φ(u head , u rel , u tail ), where u x denotes the embedding of entity/relation x and is a real-valued or complex-valued vector. φ depends on the model and varies from simple translation (Bordes et al., 2013) over bilinear forms (Yang et al., 2014) Figure 1: Our approach first trains a KGC model on the graph without using textual information (bottom left). For every annotated entity, we extract a text-based embedding v by aggregating the word embeddings for tokens in the entity's name and description (top left). A transformation Ψ is trained to map v to the space of graph-based embeddings (center). The learned mapping can then be applied to unknown entities, thus allowing the trained KGC model to be applied (right).
forms (Trouillon et al., 2016). Training happens by learning to discriminate real triples from perturbed ones, typically by negative sampling (Nickel et al., 2016).
While the knowledge graph completion models described above leverage only the structure of the graph, some approaches combine text with graph information, typically using embeddings that represent terms, sentences or documents (Goldberg, 2016). Embeddings are usually derived from language models, either in static form (Mikolov et al., 2013) or by contextualized models (Devlin et al., 2018). KGC models can employ such textual information for entities scarcely linked in the graph (Gesese et al., 2019). Most approaches combine a textual embedding with structural KGC approaches, either by initializing structural embeddings from text (Socher et al., 2013;Wang and Li, 2016), interpolating between textual and structural embeddings (Xu et al., 2017), sometimes with a joined loss (Toutanova and Chen, 2015) and gating mechanisms (Kristiadi et al., 2019). Others perform a fine-tuning for KGC based on textual labels of the entities and relations (Yao et al., 2019).
Only few other works have addressed open-world KGC so far. Xie et al. (2016) proposed DKRL with a joint training of graph-based embeddings (TransE) and text-based embeddings while regularizing both types of embeddings to be aligned using an additional loss. ConMask (Shi and Weninger, 2017a) is a text-centric approach where text-based embeddings for entities and relations are derived by an attention model over names and descriptions. Closest to our work is OWE (Shah et al., 2019), which trains graph and text embeddings independently and then learns a mapping between the two embedding spaces (more details are provided in Section 3). While OWE's mapping only take entities into account, our extended model's mapping is learned given both the entity and relation when predicting a triple. Orthogonal to our work, WOWE (Zhou et al., 2020) extends the OWE approach by replacing the averaging aggregator with a weighted attention mechanism and can be combined with our approach.

Approach
Given a knowledge graph G ⊂ E×R×E containing triples (h, r, t) (E and R denote finite sets of entities and relations), KGC models can perform tail prediction as follows: Given a pair of head and relation (h, r), the tail is estimated as where u h , u r , u t are entity/relation embeddings and φ is a model-specific scoring function. Note that this approach -and our extension -can be applied for head prediction accordingly.
We address an open-world setting, where the triple's head is not a part of the knowledge graph, i.e.  h ∈ E. However, h is assumed to come with a textual description. Our approach (Figure 1) transforms this text into a token sequence W h = (w 1 , w 2 , ..., w n ), from which a sequence of embeddings (v w 1 , v w 2 , ..., v wn ) is derived using a textual embedding model pre-trained on a large text corpus. We experimented with BERT (Devlin et al., 2018) but did not achieve major improvements over static embeddings, likely because descriptions on KGC datasets tend to be short. Instead we use Wikipedia2Vec (Yamada et al., 2016), which contains phrase embeddings for entity names like "Sheila Walsh". If no phrase embedding is available, we use token-wise embeddings instead. If no embedding is available for a token, we use a vector of zeros as an "unknown" token. The resulting sequence of embeddings is aggregated by average pooling to obtain a single text-based embedding vector of the head entity v h ∈ R d .
The key step of our approach is to learn transformation functions Ψ r that align the text-based and graph-based embedding spaces such that Ψ r (v h )≈u h . When applying this mapping, the open-world entity's text-based embedding v h is transformed into a graph-based proxy embedding Ψ r (v h ). Triples with h are scored by applying the KGC model from Equation 1 to Ψ r (v h ): Like OWE (Shah et al., 2019), we use an affine transformation Ψ r (v) = A r ·v + b r . Our focus of this paper is to deal with relation specificity, for which we propose the following two strategies: Relation Specific Transformation (RST) While the OWE model consists of a global transformation function (Ψ), our proposed RST approach trains a separate transformation function per relation (Ψ r ): Our hypothesis is that when predicting a tail (h, r, ?), including information on the relation r may be beneficial. Have a look at Table 1, where the transformation Ψ birthP lace maps v Shiela W alsh to a completely different region in the graph embedding space than Ψ recordLabel . Therefore, we learn a separate transformation Ψ r for each relation r∈R, containing a separate learnable matrix A r and vector b r . For a fair comparison with OWE, we use the ComplEx KGC model (Trouillon et al., 2016) in our experiments and use separate parameters for the real and imaginary parts.
Relation Clustering Transformation (RCT) Our second approach -Relation Clustering Transformation (RCT) -aggregates relations to clusters and then learns a separate transformation Ψ C for each cluster C. We use an agglomerative clustering approach (pseudo-code in Figure 2), which first initializes each cluster with a single relation r∈R and conducts η fusion steps, each joining "similar" clusters: Let t(C) denote the set of tails attached to any relation r in C. Clusters C, C are fused if the number of shared tails, divided by size of the smaller cluster, exceeds a threshold S:

Training
Our text embeddings v and graph embeddings u are based on pretrained models, such that the only parameters Θ to be learned are the relation matrices A r and vectors b r . First a KGC model is trained   (2019)).
on the full graph G, obtaining graph-based entity embeddings u 1 , ..., u n . We then choose the subset E t of all entities in the graph with textual descriptions, and define G t := G ∩ (E t ×R×E) as all triplets containing heads with text. We then minimize the following loss: with dist() referring either to Euclidean or cosine distance between the graph-and text-based head embeddings. As we use ComplEx, where embeddings u h contain real and imaginary parts, the above loss is summed for both parts. Since the number of entities in the datasets used is limited and overfitting is expected to be an issue, we neither fine-tune into the graph nor the text-embeddings.

Evaluation
We evaluate our model on FB20k (Xie et al., 2016), DBPedia50k (Shi and Weninger, 2017a) and FB15k-237-OWE (Shah et al., 2019) and compare our model with the state of the art on the task in open-world tail prediction. Results are illustrated in Table 2. Due to the lack of an open-world validation set on FB20k, we remove random 10% of the test triples and use them as a validation set. Hyperparameters were optimized using a grid search (details in the appendix). We use the same evaluation criteria as Shah et al. (2019), and evaluate our results only on ComplEx to provide a fair comparison with the OWE models. For training the closed-world KGC models, we utilize OpenKE (Han et al., 2018). Additionally, we use the target filtering approach (Shi and Weninger, 2017a) on any results reported. The Target Filtering Baseline is evaluated by assigning random scores to all targets that pass the target filtering criterion.

Results
We observe that the ComplEx-RST outperforms all other approaches -including OWE -by a margin on all metrics except Hits@10 on DBPedia50k. ComplEx-RCT (relation clusters) performs competitively with the ComplEx-RST (one mapping per relation), while the number of transformation functions is reduced from 351 to 279 in case of DBPedia50k, from 235 to 114 in case of FB15k-237-OWE and from 1341 to 522 in case of FB20k. The values of η and S were optimized to achieve the greatest reduction in number of clusters at a negligible negative impact on accuracy. We also note that the improvement achieved by utilizing the relation information is higher in DBPedia50k and FB20k, both of which use very long descriptions compared to FB15k-237-OWE. We believe that this is because the longer descriptions often have more pieces of information relevant to the relation, which the relationspecific transformations are able to extract and utilize. Finally, in Figure 2 (right) we investigate which relations benefit strongest from relation-specific mappings: Each point represents a relation in FB15k-237-OWE. We observe that points to the left (rare relations) tend to benefit stronger from learning a transformation of their own (RST). Those scarce relations seem to be underrepresented in the training data and -accordingly -in the global mapping.

Conclusion
We have proposed a simple approach to incorporate relation-specific information into open-world knowledge graph completion. Our approach achieves state-of-the-art results on common open-source benchmarks and offers strong improvements over relation-agnostic state-of-the-art methods. An interesting direction for future work will be to adapt our model to longer textual inputs, e.g. by using attention to enable the model to select relevant passages (similar to ConMask (Shi and Weninger, 2017a)).