Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

The mission of open knowledge graph (KG) completion is to draw new findings from known facts. Existing works that augment KG completion require either (1) factual triples to enlarge the graph reasoning space or (2) manually designed prompts to extract knowledge from a pre-trained language model (PLM), exhibiting limited performance and requiring expensive efforts from experts. To this end, we propose TAGREAL that automatically generates quality query prompts and retrieves support information from large text corpora to probe knowledge from PLM for KG completion. The results show that TAGREAL achieves state-of-the-art performance on two benchmark datasets. We find that TAGREAL has superb performance even with limited training data, outperforming existing embedding-based, graph-based, and PLM-based methods.


Introduction
A knowledge graph (KG) is a heterogeneous graph that encodes factual information in the form of entity-relation-entity triplets, where a relation connects a head entity and a tail entity (e.g., "Miami-located_in-USA") (Wang et al., 2017;Hogan et al., 2021).KG (Dai et al., 2020) plays a central role in many NLP applications, including question answering (Hao et al., 2017;Yasunaga et al., 2021), recommender systems (Zhou et al., 2020), and drug discovery (Zitnik et al., 2018).However, existing works (Wang et al., 2018;Hamilton et al., 2018) show that most large-scale KGs are incomplete and cannot fully cover the massive real-world knowledge.This challenge motivates KG completion, which aims to find one or more object entities given a subject entity and a relation (Lin et al., 2015).For example, in Figure 1, our goal is to predict the object entity with "Detroit" as the subject entity and "contained_by" as the relation.
However, existing KG completion approaches (Trouillon et al., 2016b;Das et al., 2018) have sev- Michigan (I'm pretty sure!!!) Figure 1: The quality of hand-crafted prompts can be limited, while prompt mining is a scalable alternative.Support information also helps PLM understand the purpose of prompts.In this example, Canada and Michigan are potentially valid options but given prompt mining and support information retrieval, the model becomes confident about Michigan as the answer here.
eral limitations (Fu et al., 2019).First, their performance heavily depends on the density of the graph.They usually perform well on dense graphs with rich structural information but poorly on sparse graphs which are more common in real-world applications.Second, previous methods (e.g., Bordes et al. (2013)) assume a closed-world KG without considering vast open knowledge in the external resources.In fact, in many cases, a KG is usually associated with a rich text corpus (Bodenreider, 2004), which contains a vast amount of factual data not yet extracted.To overcome these challenges we investigate the task of open knowledge graph completion, where KG can be constructed using new facts from outside the KG.Recent text-enriched solutions (Fu et al., 2019) focus on using a predefined set of facts to enrich the knowledge graph.Nonetheless, the pre-defined set of facts is often noisy and constricted, that is, they do not provide sufficient information to efficiently update the KG.
Pre-trained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019a) have shown to be powerful in capturing factual knowledge implicitly from learning on massive unlabeled texts (Petroni et al., 2019b).Since PLMs are superb in text encoding, they can be utilized to facilitate knowledge graph completion with external text information.Recent knowledge graph completion methods (Shin et al., 2020;Lv et al., 2022) focus on using manually crafted prompts (e.g., "Detroit is located in [MASK]" in Figure 1) to query the PLMs for graph completion (e.g., "Michigan").However, manually creating prompts can be expensive with limited quality (e.g., PLM gives a wrong answer "Canada" to the query with a handcrafted prompt, as shown in Figure 1).
Building on the above limitations of standard KG and the enormous power of PLMs (Devlin et al., 2019;Liu et al., 2019a), we aim to use PLMs for open knowledge graph completion.We propose an end-to-end framework that jointly exploits the implicit knowledge in PLMs and textual information in the corpus to perform knowledge graph completion (as shown in Figure 1).Unlike existing works (e.g., (Fu et al., 2019;Lv et al., 2022)), our method does not require a manually pre-defined set of facts and prompts, which is more general and easier to adapt to real-world applications.
Our contributions can be summarized as: • We study the open KG completion problem that can be assisted by facts captured from PLMs.To this end, we propose a new framework TAGREAL that denotes text augmented open KG completion with real-world knowledge in PLMs.
• We develop prompt generation and information retrieval methods, which enable TAGREAL to automatically create highquality prompts for PLM knowledge probing and search support information, making it more practical especially when PLMs lack some domain knowledge.
• Through extensive quantitative and qualitative experiments on real-world knowledge graphs such as Freebase1 we show the applicability and advantages of our framework.2 2 Related Work

KG Completion Methods
KG completion methods can be categorized into embedding-based and PLM-based methods.
Embedding-based methods represent entities and relations as embedding vectors and maintain their semantic relations in the vector space.TransE (Bordes et al., 2013) vectorizes the head, the relation and the tail of triples into a Euclidean space.Dist-Mult (Yang et al., 2014) converts all relation embeddings into diagonal matrices in bilinear models.RotatE (Sun et al., 2019) presents each relation embedding as a rotation in complex vector space from the head entity to the tail entity.
In recent years, researchers have realized that PLMs can serve as knowledge bases (Petroni et al., 2019a;Zhang et al., 2020;AlKhamissi et al., 2022).PLM-based methods for KG completion (Yao et al., 2019;Kim et al., 2020;Chang et al., 2021;Lv et al., 2022) start to gain attention.As a pioneer, KG-BERT (Yao et al., 2019) fine-tunes PLM with concatenated head, relation, and tail in each triple, outperforming the conventional embedding-based methods in link prediction tasks.Lv et al.(2022) present PKGC, which uses manually designed triple prompts and carefully selected support prompts as inputs to the PLM.Their result shows that PLMs could be used to substantially improve the KG completion performance, especially in the open-world (Shi and Weninger, 2018) setting.Compared to PKGC, our framework TAGREAL automatically generates prompts of higher quality without any domain expert knowledge.Furthermore, instead of pre-supposing the existence of support information, we search relevant textual information from the corpus with an information retrieval method to support the PLM knowledge probing.

Knowledge Probing using Prompts
LAMA (Petroni et al., 2019a) is the first framework for knowledge probing from PLMs.The prompts are manually created with a subject placeholder and an unfilled space for the object.For example, a triple query (Miami, location, ?) may have a prompt "Miami is located in [MASK]" where "<subject> is located in [MASK]" is the template for "location" relation.The training goal is to correctly fill [MASK]with PLM's prediction.Another work, BertNet (Hao et al., 2022), proposes an approach applying GPT-3 (Brown et al., 2020   to automatically generate a weighted prompt ensemble with input entity pairs and a manual seed prompt.It then uses PLM again to search and select top-ranked entity pairs with the ensemble for KG completion.

Prompt Mining Methods
When there are several relations to interpret, manual prompt design is costly due to the requirement of domain expert knowledge.In addition, the prompt quality could not be ensured.Hence, quality prompt mining catches the interest of researchers.Jiang et al. 2020 propose an approach MINE which searches middle words or dependency paths between the given inputs and outputs in a large text corpus (e.g., Wikipedia).They also propose a reasonable approach to optimize the ensemble of the mined prompts by weighting prompt individuals regarding their performance on the PLM.
Before the emergence and widespread use of PLMs, textual pattern mining performed a similar function to find reliable patterns for information extraction.For instance, MetaPAD (Jiang et al., 2017) generates quality meta patterns by context-aware segmentation with the pattern quality function, and TruePIE (Li et al., 2018) proposes the concept of pattern embedding and a self-training framework, that discovers positive patterns automatically.

Methodology
We propose TAGREAL, a PLM-based framework to handle KG completion tasks.In contrast to the previous work, our framework does not rely on handcrafted prompts or pre-defined relevant facts.As shown in Figure 2, we automatically create appropriate prompts and search relevant support information, which are further utilized as templates to explore implicit knowledge from PLMs.

Problem Formulation
Knowledge graph completion is to add new triples (facts) to the existing triple set of a KG.There are two tasks to achieve this goal.The first is triple classification, which is a binary classification task to predict whether a triple (h, r, t) belongs to the KG, where h, r, t denote head entity, relation and tail entity respectively.The second task is link prediction, which targets on predicting either the tail entity t with a query (h, r, ?) or the head entity h with a query (?, r, t).

Prompt Generation
Previous studies (e.g., Jiang et al. (2020)) demonstrate that the accuracy of relational knowledge extracted from PLMs heavily relies on the quality of prompts used for querying.To this end, we develop a comprehensive approach for automatic quality prompt generation given triples in KG as the only input, as shown in Figure 3.We use textual pattern mining methods to mine quality patterns from large corpora as the prompts used for PLM knowledge probing.As far as we know, we are pioneers in using textual pattern mining methods for LM prompt mining.We believe in the applicability of this approach for the following reasons.
• Similar data sources.We apply pattern mining on large corpora (e.g., Wikipedia) which are the data sources where most of PLMs are pre-trained.
• Similar objectives.Textual pattern mining is to mine patterns to extract new information from large corpora; prompt mining is to mine prompts to probe implicit knowledge from PLMs.
• Similar performance criteria.The reliability of a pattern or a prompt is indicated by how many accurate facts it can extract from corpora/PLMs.
Sub-corpora mining is the first step that creates the data source for the pattern mining.Specifically, given a KG with a relation set R = (r 1 , r 2 , ..., r k ), we first extract tuples T r i paired by head entities and tail entities for each relation r i ∈ R from the KG.For example, for the relation r 1 : /business/company/founder, we extract all tuples like <microsoft, bill_gates> in this relation from the KG.For each tuple t j , we then search sentences s t j containing both head and tail from a large corpus (e.g., Wikipedia) and other reliable sources, which is added to compose the sub-corpus C r i .We limit the size of each set to θ Final prompts for relation Pr k 2 P < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 + m c S s x W B b x f 8 V z L G u u p e 2 5 v Q X k = " > A A A C C n i c b V B N S 8 N A E J 3 4 W e t X 1 K O X 1 S J 4 K o m K e i x 6 8 V j B f k B T y m a 7 a Z d u N m F 3 I 5 S Q s x f / i h c P i n j 1 F 3 j z 3 7 h p c 6 i t D w Y e 7 8 0 w M 8 + P O V P a c X 6 s p e W V 1 b X 1 0 k Z 5 c 2 t 7 Z 9 f e 2 2 + q K J G E N k j E I 9 n 2 s a K c C d r Q T H P a j i X F o c 9 p y x / d 5 n 7 r k U r F I v G g x z H t h n g g W M A I 1 k b q 2 U d e i P W Q Y J 7 W s 1 4 q e 2 6 G P b E J w 5 1 9 e J M 2 z q n t Z P b + / q N R u i j h K c A j H c A o u X E E N 7 q A O D S D w B C / w B u / W s / V q f V i f 0 9 Y l q 5 g 5 g D + w v n 4 B e y C a x Q = = < / l a t e x i t > Pr 1 2 P < l a t e x i t s h a 1 _ b a s e 6 4 = " H J 7 g e k n R g e X 4 T X p 1 z 7 + f 4 5 9 A e u 0 = " S U i a 0 5 7 1 Z p X 9 2 Z w l 4 l f k B o U a P a q X 9 2 + o m n M J F J B j A l 8 L 8 E w I x o 5 F W x a 6 a a G J Y S O y Z A F l k o S M x N m s 8 h T 9 8 Q q f X e g t H 0 S 3 Z n 6 e y M j s T G T O L K T e U S z 6 O X i f 1 6 Q 4 u A 6 z L h M U m S S z j 8 a p M J F 5 e b 3 u 3 2 u G U U x s Y R Q z W 1 W l 4 6 I J h R t S x V b g r 9 4 8 j J 5 P K v 7 l / X z + 4 t a 4 6 a o o w x H c A y n 4 M M V N O A O m t A C C g q e 4 R X e H H R e n H f n Y z 5 a c o q d Q / g D 5 / M H i b W R b g = = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 X I + w J j q E G R / m + o h f I G x T E I g v C k = " > A A A B 9 H i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e y q q M e i F 4 9 V 7 A e 0 S 8 m m 2 T Y 0 m 1 2 T b K E s + z u 8 e F D E q z / G m / / G t N 2 D t j 4 Y e L w 3 w 8 w 8 h 6 9 B I v g q e y q q M e i F 4 9 V 7 A e 0 S 8 m m 2 T Y 0 m 1 2 T b K E s + z u 8 e F D E q z / G m / / G t N 2 D t j 4 Y e L w 3 w 8 w 8 q m j M T Z D P j p 2 Q E 6 v 0 S Z R o W w r J T P 0 9 k d P Y m H E c 2 s 6 Y 4 t A s e l P x P 6 + T Y X Q d 5 E K l G X L F 5 o u i T B J M y P R z 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t D m U 7 E h e I s v L 5 P H s 7 p 3 W T + / v 6 g 1 b o o 4 y n A E x 3 A K H l x B A + 6 g C T 4 w E P A M r / D m K O f F e X c + 5 q 0 l p 5 g 5 h D 9 w P n 8 A z O W O s A = = < / l a t e x i t > t1 < l a t e x i t s h a 1 _ b a s e 6 4 = " d P W j q X 6 / n V h I Z X 4 Q h S r A K W 9 2 L P E = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q q M e i F   for each tuple to mine more generic patterns for future applications.
Phrase segmentation and frequent pattern mining are applied to mine patterns from subcorpora as prompt candidates.We use AutoPhrase (Shang et al., 2018) to segment corpora to more natural and unambiguous semantic phrases, and use FP-Growth algorithm (Han et al., 2000) to mine frequent appeared patterns to compose a candidate set ).The size of the set is large, as there are plenty of messy textual patterns.
Prompt selection.To select quality patterns from the candidate set, we apply two textual mining approaches: MetaPAD (Jiang et al., 2017) and TruePIE (Li et al., 2018).MetaPAD applies pattern quality function introducing several criteria of contextual features to estimate the reliability of a pattern.We explain why those features can also be adapted for LM prompt estimation: (1) Frequency and concordance: Since a PLM learns more contextual relations between frequent patterns and entities during the pre-training stage, a pattern occurs more frequently in the background corpus can probe more facts from the PLM.Similarly, if a pattern composed of highly associated sub-patterns appears frequently, it should be considered as a good one as the PLM would be familiar with the contextual relations among the sub-patterns.(2) Informativeness: A pattern with low informativeness (e.g., p ′ 1 in Figure 3) has the weak ability of PLM knowledge probing, as the relation between the subject or object entities cannot be well interpreted by it.(3) Completeness: The completeness of a pattern affects a lot to the PLM knowledge probing especially when any of the placeholders is missing (e.g., p ′ m−2 in Figure 3) so that PLM cannot even give an answer.(4) Coverage: A quality pattern should be able to probe accurate facts from PLM as many as possible.Therefore, patterns like p ′ 4 which only suit a few or only one case should have a low quality score.We then apply TruePIE on the prompts (patterns) selected by MetaPAD.TruePIE filters the prompts that have low cosine similarity with the positive samples (e.g., p ′ 3 and p ′ m−1 are filtered), which matters to the creation of prompt ensemble since we want the prompts in the ensemble to be semantically close to each other so that some poor-quality prompts would not significantly impact the prediction result by PLM.As a result, we create a more reliable prompt ensemble P r i = {p i,1 , p i,2 , ..., p i,n } based on the averaged probabilities given by the prompts: where r i is the i-th relation and p i,j is the j-th prompt in P r i .Beyond prompt selection, a prompt optimization process is also employed.Pointed out by Jiang et al. 2020, some prompts in the ensemble are more reliable and ought to be weighted more.Thus, we change Equation 1 to: where w i,j is the weight of j-th prompt for i-th relation.In our setting, all weights {w 1,1 , .., w k,n } Query  !" : < icrosoft, /business/company/founder, ?>

BM25
"however, microsoft is planning a significant marketing push into the field with a keynote speech by bill_gates, the company 's co-founder and chairman." "[CLS] however, microsoft is planning a significant marketing push into the field with a keynote speech by bill_gates, the company 's co-founder and chairman.are learned through PLM to optimize P (y|x, r i ) for r i ∈ R ahead of the training process.

Support Information Retrieval
In addition to the prompt mining, we also attach some query-wise and triple-wise support text information to the prompt to help the PLMs understand the knowledge we want to probe as well as to aid in training triple classification ability.As seen in Figure 4, for the i-th query q r i in relation r, we use BM25 (Robertson et al., 1995) to retrieve highly ranked support texts with score greater than δ and length shorter than ϕ from the reliable corpus and randomly select one of them as the support information.To compose the input cloze qr i to the PLM, we concatenate the support text to each prompt in the optimized ensemble we obtained through previous steps, with the subject filled and the object masked.
[CLS] and [SEP] are the tokens for sequence classification and support information-prompt separation accordingly.
In the training stage, we search texts using triples rather than queries, and the [MASK] would be filled by the object entities.It is worth noting that support text is optional in TAGREAL, and we leave it blank if no matching data is found.

Training
To train our model, we create negative triples in addition to the given positive triples following the idea introduced by PKGC (Lv et al., 2022), to handle the triple classification task.We create negative triples by replacing the head and tail in each positive triple with the "incorrect" entity that achieves high probability by the KGE model.We also create random negative samples by randomly replacing the heads and tails to enlarge the set of negative training/validation triples.The labeled training triples are assembled as T = T + ∪ (T − KGE ∪ T − RAN D ) where T + is the positive set, T − KGE and T − RAN D are two negative sets we created by embedding model-based and random approaches respectively.Then, we transform all training triples of each relation r into sentences with the prompt ensemble P r and the triple-wise support information retrieved by BM25 (if there is any).At the training stage, the [MASK] is replaced by the object entity in each positive/negative triple.The query instances qr i are then used to fine-tune the PLM by updating its parameters.Cross-entropy loss (Lv et al., 2022) is applied for optimization: where c 0 τ , c 1 τ ∈ [0, 1] are the softmax classification scores of the token [CLS] for the triple τ , y τ is the ground truth label (1/0) of the triple, and is the ratio between the number of positive and negative triples.After the PLM is fine-tuned with positive/negative triples in training set, it should have a better performance on classifying the triples in the dataset compared to a raw PLM.This capability would enable it to perform KG completion as well.

Inference
Given a query (h, r, ?), we apply the query-wise support information that is relevant to the head entity h and relation r, as we presume that we are unaware of the tail entity (our prediction goal).Then, we make the corresponding query instances containing [MASK], with both support information and prompt ensemble, as shown in Figure 4. To leverage the triple classification capability of the PLM on link prediction, we replace [MASK] in a query instance with each entity in the known entity set and rank their classification scores in descending order to create a 1-d vector as the prediction result for each query.This indicates that the lower-indexed entities in the vector are more likely to compose a positive triple with the input query.For prompt ensemble, we sum up the scores by entity index before ranking them.The detailed illustration is placed in Appendix E. Model 20% 50% 100% Hits@5 Hits@10 MRR Hits@5 Hits@10 MRR Hits@5 Hits@10 MRR

KGE-based
TransE (Bordes et al., 2013) 29 contains more general relations (e.g., "nationality of perso") whereas UMLS-PubMed focuses on biomedical domain-specific relations (e.g., "gene mapped to diseas").We apply the pre-processed dataset3 (with training/validation/testing data size 8:1:1) to align the evaluation of our method with the baselines.Due to the imbalanced distribution and noise present in FB60K-NYT10 and UMLS-PubMed, 16 and 8 relations are selected for the performance evaluation, respectively.We place more details of the datasets in Appendix A.

Experimental Setup
For FB60K-NYT10, we use LUKE (Yamada et al., 2020), a PLM pre-trained on more Wikipedia data with RoBERTa (Liu et al., 2019b).For UMLS-PubMed, we use SapBert (Liu et al., 2021) that pretrained on both UMLS and PubMed with BERT (Devlin et al., 2019).For sub-corpora mining, we use Wikipedia with 6,458,670 document examples as the general corpus and NYT10/PubMed as the reliable sources, and we mine 500 sentences at maximum (θ = 500) for each tuple.For the prompt selection, we apply MetaPAD with its default setting, and apply TruePIE with the infrequent pattern penalty, and thresholds for positive patterns and negative patterns reset to {0.5, 0.7, 0.3} respectively.For support information retrieval, we use BM25 to search relevant texts with δ = 0.9 and ϕ = 100 in the corpora NYT10/PubMed.We follow the same fine-tuning process as PKGC.We use TuckER as the KGE model to create negative triples, and we set M = 30 as the ratio of positive/negative triples.To compare with baselines, we test our model on training sets in the ratios of [20%, 50%, 100%] for FB60K-NYT10 and [20%, 40%, 70%, 100%] for UMLS-PubMed.The evaluation metrics are described in Appendix F.
works in most cases.Given dense training data, KGE-based methods (e.g., RotatE) and RL-based methods (e.g., CPL) can still achieve relatively high performance.However, when the training data is limited, these approaches suffer, whereas PLM-based methods (PKGC and TAGREAL) are not greatly impacted.Our approach performs noticeably better in such cases than the current non-PLM-based ones.This is because the KGE models cannot be trained effectively with inadequate data, and the RL-based path-finding models cannot recognize the underlying patterns given insufficient evidential and general paths in KG.On the other hand, PLMs already possess implicit information that can be used directly, and the negative effects of insufficient data in fine-tuning would be less harsh than in training from scratch.TAGREAL outperforms PKGC due to its ability to automatically mine quality prompts and retrieve support information in contrast to manual annotations which are often limited.Next, we analyze the impacts of support information and prompt generation on the performance of TAGREAL.

Model Analysis
We conduct an ablation study to verify the effectiveness of both automatically generated prompts and retrieved support information.The results are presented in Table 3, Figure 5 and 6.Support Information.As shown in Table 3, for FB60K-NYT10, support information helps improve Hits@5 and Hits@10 in ranges of [5.2%,

Manual Prompt
[Y] is located in [X].
7.5%] and [3.8%, 5.3%], respectively.For UMLS-PubMed, it helps improve Hits@5 and Hits10 in ranges of [1.9%, 4.94%] and [0.9%, 3.6%], respectively.Although the overlap between UMLS and PubMed is higher than that between FB60K and NYT10 (Fu et al., 2019), the textual information in PubMed could not help as much as NYT10 since that: (1) SapBert already possesses adequate implicit knowledge on both UMLS and PubMed so that a large portion of additional support texts might be useless.The lines "u2", "u3", "u4" and "u5" in Figure 5 show that support information helps more when using LUKE as the PLM as it contains less domain-specific knowledge.It also infers that the support information could be generalized to any application, especially when fine-tuning a PLM is difficult in low-resource scenarios (Arase and Tsujii, 2019;mahabadi et al., 2021).
(2) UMLS contains more queries with multiple correct answers than FB60K (see Appendix A), which means some queries are likely "misled" to another answer and thus not counted into the Hits@N metric.Prompt Generation.Almost all of the relations, as shown in Figure 6, could be converted into better prompts by our prompt mining and optimization, albeit some of them might be marginally worse than manually created prompts due to the following fact.A few of the mined prompts, which are of lower quality than the manually created prompts, may significantly negatively affect the prediction score for the ensemble with equal weighting.Weighting based on PLM reduces such negative effects of the poor prompts for the optimized ensembles and enables them to outperform most handcrafted prompts.In addition, Table 3 shows the overall improvement for these three types of prompts, demonstrating that for both datasets, optimized ensembles outperform equally weighted ensembles, which in turn outperform manually created prompts.Moreover, by comparing line "f1" with line "f2", or line "u1" with line "u3" in Figure 5, we find a performance gap between PLM with manual prompts and with the optimized ensemble for triple classification, highlighting the effectiveness of our method.

Case Study
Figure 7 shows an example of using TAGREAL for link prediction with a query (?, /location/location/ contains, alba) where "piedmont" is the ground truth.By comparing the prediction results in different pairs, we find that both prompt generation and support information could enhance the KG completion performance.With the handcrafted prompt, the PLM simply lists out the terms that have some connections to the subject entity "alba" without being aware that we are trying to find the place it is located in.Differently, with the optimized prompt ensemble, the PLM lists entities that are highly relevant to our target, where "cuneo", "italy", "north-ern_italy" are correct real-world answers, indicating that our intention is well conveyed to the PLM.With the support information, the PLM increases the score of entities that are related to the keywords ("italy", "piedmont") in the text.Moreover, the optimized ensemble removes "texas" and "scotland" from the list and leaves only Italy-related locations.More examples are placed in Appendix H.

Conclusion and Future Works
In this study, we proposed a novel framework to exploit the implicit knowledge in PLM for open KG completion.Experimental results show that our method outperforms existing methods especially when the training data is limited.We showed that the optimized prompts with our approach outperform the handcrafted ones in PLM knowledge probing.The effectiveness of the support information retrieval to aid the prompting is also demonstrated.In the future, we may leverage QA model's power to retrieve more reliable support information.Another potential extension is to make our model more explainable by exploring path-finding tasks.

Limitations
Due to the nature of deep learning, our method is less explainable than path-finding-based KG completion methods (e.g., CPL), which provide a concrete reasoning path to the target entity.Composing the path with multiple queries might be an applicable strategy that is worthwhile to investigate in order to extend our work on the KG reasoning task.
For the link prediction task, we adapt the "recall and re-ranking" strategy from PKGC (Lv et al., 2022), which brings a trade-off between prediction efficiency and accuracy.We alleviate the issue by applying different hyper-parameters given different sizes of training data, which is discussed in detail in Appendix C.
As a common issue of existing KG completion models, the performance of our model also degrades when the input KG contains noisy data.The advantage of our approach in addressing this issue is that it can use both corpus-based textual information and implicit PLM knowledge to reduce noise.

Ethical Statements
In this study, we use two datasets FB60K-NYT10 and UMLS-PubMed, which include the knowledge graphs FB60K and UMLS as well as the text corpora NYT10 and PubMed.The data is all publicly available.Our task is knowledge graph completion, which is performed by finding missing facts given existing knowledge.This work is only relevant to NLP research and will not be put to improper use by ordinary people.

A Dataset Overview
We use the datasets FB60K-NYT10 and UMLS-PubMed provided by (Fu et al., 2019). 4They take the following steps to split the data: (1) split the data of each KG (FB60K or UMLS) in the ratio of 8:1:1 for training/validation/testing data.(2) For training data, they keep all triples in any relations.
(3) For validation/testing data, they only keep the triples in 16/8 relations they concern (see relations in Table 5).The processed data has {train:  Query-triple ratio.Within the relations that we focus on, we calculate the ratio of the triples by the queries (including both (h, r, ?) and (?, r, t)) to indicate the number of correct answers a query may have in average.The result is given in Table 4.For UMLS-PubMed, as the relations are symmetric in pairs, the number of queries for head and tail predictions are the same.Table 5 presents the counting in a more detailed setting.Both tables show that there are more multi-answer queries in UMLS-PubMed than in FB60K-NYT10, which explains why the support information may not be as helpful in the former as it is in the latter, as revealed by Table 3 and discussed in Section 5.2.

B Textual Pattern Mining
The purpose of pattern mining is to find rules that describe particular patterns in the data.Information extraction is a common goal for pattern mining and prompt mining, where the former focuses on extracting facts from massive text corpora and the latter on extracting facts from PLMs.In this section, we use another example (Figure 8) to explain in detail how the textual pattern mining approaches like MetaPAD (Jiang et al., 2017) and TruePIE (Li et al., 2018) are implemented to mine quality prompts.In the example, given the relation location/neighborhood/neighborhood_of as the input, we first extract tuples (e.g., <east new york, brooklyn>) in the relation from the KG (i.e., FB60K).Then, we construct a sub-corpus by searching the sentences in a large corpus (e.g., Wikipedia) and the KG-related corpus (i.e.NYT10 for FB60).After the creation of subcorpus, we apply phrase segmentation and frequent pattern mining to mine raw prompt candidates.Since the candidate set is noisy as some prompts with low completeness (e.g., in lower [Y]), low informativeness (e.g., the [Y], [X]) and low coverage (e.g., [X], manhattan, [Y]) are present, we use MetaPAD to handle the prompt filtering with its quality function introducing those contextual features.After the prompts have been processed by MetaPAD, we choose one of them to serve as a seed prompt (for example, [X] neighborhood of [Y]) so that other prompts can be compared to it by computing their cosine similarity.As the positive seed prompt is selected manually, we can tell that there is still room for future improvement.

C Re-ranking Recalls from KGE Model
Re-ranking framework.
According to the inference process we present in Figure 9, we fill the placeholder ([MASK]) with each entity (e 1 , e 2 , ..., e n ) in the entity set E. However, as mentioned by Lv et al.2022, the inference speed of PLM-based models is much slower than that of KGE models, which is a disadvantage of using PLM for KG completion.To address this issue, they use the recalls from KGE models, that is, using KGE models to run KG completion and select X top-ranked entities for each query as the entity set E. Then, they shuffle the set and re-rank those entities using the PLM-based model.In our work, we adapt this re-ranking framework to accelerate the inference and evaluation as our time complexity is Z times as large as PKGC (Lv et al., 2018) for each case where Z is the size of prompt ensemble.We use the recalls from TuckER (Balažević et al., 2019)   Limitations.Nonetheless, implementing the re-ranking framework has a trade-off between efficiency and Hits@N performance.When the training data is large (e.g., 100%), the KGE model could be well trained so that the ground truth entity e gt is more likely to be contained in the top X ranked ones.However, when the training data is limited (e.g., 20%), the trained KGE model could not perform well on link prediction, as shown in Table 1 and 2. In such a case, there is a probability that e gt is not among the top X entities if we keep using the same X regardless of the size of the training data.To alleviate this side effect, we test and select different values of the hyper-parameter X for different sizes of training data, as presented in Table 6  To check how much space there is for improvement, we manually add the ground truth entity into the recalls (we should not do this for the evaluation of TAGREAL as we suppose the object entity is unknown) and test the performance of TAGREAL on UMLS-PubMed.The result is shown in Table 7.By comparing this data with Table 3 for UMLS-PubMed, we find that changing the values of X could not perfectly address the issue.We leave the improvement as one of our major future works.Table 7: Link prediction of TAGREAL on UMLS-PubMed with ground truth added to the KGE recalls.Data in brackets are Hits@5 (left) and Hits@10 (right).

D Computing Infrastructure & Budget
We trained and evaluated TAGREAL on 7 NVIDIA RTX A6000 running in parallel as we support multi-GPU computing.Training TAGREAL to a good performance took about 22 and 14 hours on the entire FB60K-NYT10 dataset (with LUKE (Yamada et al., 2020)) and the entire UMLS-PubMed dataset (with SapBert (Liu et al., 2021)) respectively.The training time is proportional to the size (ratio) of the training data.The evaluation took about 12 minutes for FB60K-NYT10 with LUKE when hyper-parameter X = 20, and 16 minutes for UMLS-PubMed with SapBert when X = 30.
The evaluation time is proportional to X, which explains why we applied the re-ranking framework (Appendix C) to improve the prediction efficiency.For the link prediction with equally-weighted or optimized ensembles, we apply the method shown in Figure 9. Specifically, for each sentence with [MASK] filled with an entity e i , we calculate its classification score with the fine-tuned PLM.For each query, we get an m × n matrix where m is the number of prompts in the ensemble, n is the number of entities in the entity set (which is X if the re-ranking framework is applied).For an ensemble that is equally weighted, we simply sum the scores of each entity obtained from the different prompts, whereas for an optimized ensemble, we multiply the weighting of the prompts by the scores before the addition.After sorting the vector in size of 1×n in descending order, we can get the ranking of entities as the result of the link prediction.

F Evaluation Metrics
Following previous KG completion works (Fu et al., 2019;Lv et al., 2022), we use Hits@N and Mean Reciprocal Rank (MRR) as our evaluation metrics.As mentioned in Section 3.5, the prediction of each query (h, r, ?) is a 1-d vector of indices of entities in descending order regarding their scores.Specifically, for a query q i , We record the rank of the object entity t as R i , then we have: where Q is the number of queries in evaluation.To exploit the power of PLM, we need to map the code (entity_id) in KG/corpus into the words (Figure 10 shows the performance difference of PLM between using word and using code).For FB60K-NYT10, we use the mapping provided by JointNRE (Han et al., 2018) 5 , which covers the translation for all entities.For UMLS-PubMed, we jointly use three mappings6,7,8 which cover 97.22% of all entities.

H Case Study
In addition to Figure 7, we show more examples applying TAGREAL on link prediction in Figure 11.We can see that the predictions with optimized prompt ensemble outperform those with manual prompts in all the cases, and even outperforms predictions with manual prompts and support information in some cases.In all these examples, the support information aids the PLM knowledge probing in different ways.For the first example, we believe that the PLM captures the words "brother james_murray" and "his wife jenny", and realize that we are talking about the Scottish lexicographer "james_murray" but not the American comedian with the same name, based on our survey.For the second example, the PLM probably captures "glycemic control" which is highly relevant to the disease "hyperglycemia".For the third example, the term "antiemetic" (the drug against vomiting) is likely captured so that the answer "vomiting" could be correctly predicted.Hence, it is not necessary for the support information to include the object entity itself, and including only some text relevant to it could also be helpful.

I Re-evaluation of Knowledge Graph Embedding Models
We find that the performance of some KGE models was underestimated by Fu et al.2019 due to the low embedding dimension set for entity and relation.According to our re-evaluation (Table 8), many of these models could perform much better with higher dimension, and we report their best performance in Table 1 and 2 based on our experiments.For the previously evaluated models, we use the same code 9,10,11 as Fu et al. used to ensure the fairness of the comparison.For TuckER (Balažević et al., 2019), we use the code provided by the author.12Same as Fu et al., to make the comparison more rigorous, we do not apply the filtered setting (Bordes et al., 2013;Sun et al., 2019) of the Hits@N evaluation to all the models including TAGREAL.
Detroit is the largest city in the U.S state of Michigan.People from Detroit, [MASK] .

Figure 2 :
Figure 2: TAGREAL Framework.The input and output of each phase are highlighted by red and green, respectively.The dotted arrow indicates the optional process.

∶
/business/company/founder ⋮ ∶ /location/neighbor/neighbor_of < l a t e x i t s h a 1 _ b a s e 6 4 = " X U b n 6 1 Z b 6 8 P 6 H L c u W J O Z P f g D 6 + s X m W C e D w = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " r F r 2 S T c y M R p b g c b m 1 3 8 e V l 0 j y r u p f V 8 / u L S u 0 m j 6 M I R 3 A M p + D C F d T g D u r Q A A p P 8 A y v 8 I b G 6 A W 9 o 4 9 5 a w H l M 4 f w B + j z B z C C k b s = < / l a t e x i t > r1 2 R < l a t e x i t s h a 1 _ b a s e 6 4 = " Z 6 4 8 V T F t o Q 9 l s t + 3 S z S b s T o Q S + h u 8 e F D E q z / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J H C o O t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b/ + g f H j U N H G q G f d Z L G P d D q n h U i j u o 0 D J 2 4 n m N A o l b 4 X j u 5 n f e u L a i F g 9 4 i T h Q U S H S g w E o 2 g l P 8 N e b d o r V 9 y q O w d Z J V 5 O K p C j 0 S t / d f s x S y O u k E l q T M d z E w w y q l E w y a e l b m p 4 Q t m Y D n n H U k U j b o J s f u y U n F m l T w a x t q W Q z N X f E x m N j J l E o e 2 M K I 7 M s j c T / / M 6 K Q 5 u g k y o J E W u 2 G L R I J U E Y z L 7 n P S F 5 g z l x B L K t L C 3 E j a i m j K 0 + Z R s C N 7 y y 6 u k W a t 6 V 9 W L h 8 t K / T a P o w g n c A r n 4 M E 1 1 O E e G u A D A w H P 8 A p v j n J e n H f n Y 9 F a c P K Z Y / g D 5 / M H z m q O s Q = = < / l a t e x i t >t2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Q p S K n 8 8 0 C c a i p h e c y A 5 7 m + 6 R W E = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 1 G P R i 8 c K x h b a U D b b T b t 0 s w m 7 E 6 G E / g Y v H h T x 6 g / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h 8 8 m i T T j P s s kY l u h 9 R w K R T 3 U a D k 7 V R z G o e S t 8 L R 7 d R v P X F t R K I e c J z y I K Y D J S L B K F r J z 7 E 3 n P S q N b f u z k C W i V e Q G h R o 9 q p f 3 X 7 C s p g r Z J I a 0 / H c F I O c a h R M 8 k m l mx m e U j a i A 9 6 x V N G Y m y C f H T s h J 1 b p k y j R t h S S m f p 7 I q e x M e M 4 t J 0 x x a F Z 9 K b i f 1 4 n w + g 6 y IV K M + S K z R d F m S S Y k O n n p C 8 0 Z y j H l l C m h b 2 V s C H V l K H N p 2 J D 8 B Z f X i a P Z 3 X v s n 5 + f 1 F r 3 B R x l O E I j u E U P L i C B tx B E 3 x g I O A Z X u H N U c 6 L 8 + 5 8 z F t L T j F z C H / g f P 4 A I I e O 5 w = = < / l a t e x i t t e x i t s h a 1 _ b a s e 6 4 = " Q 5 2 5 y b o Z 0 y j 4 B L G x U 5 q I G F 8 y P r Q t l S x C I y f T e 8 d 0 2 O r 9 G g Y a 1 s K 6 V T 9 P Z G x y J h R F N j O i O H A z H s T 8 T + v n W J 4 5 W d C J S m C 4 r N F Y S o p x n T y P O 0 J D R z l y B L G t b C 3 U j 5 g m n G 0 E R V t C N 7 8 y 4 u k c V r x L i p n 9 + f l 6 k 0 e R 4 E c k i N y Q j x y S a r k j t R I n X A i y T N 5 J W / O o / P i v D s f s 9 Y l J 5 8 5 I H / g f P 4 A f 9 + P o Q = = < / l a t e x i t > = ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 / c 1 a 0 I C F j G i j Z D C Y R y 8 U 3 m 6 8 z g = " > A A A B + n i c b V D L S s N A F L 2 p r 1 p f q S 7 d D B b B V U l U 1 G X R j c s K 9 g F t C J P p t B 0 6 m Y S Z i V J i P s W N C 0 X c + i X u / B s n b R b a e m D g c M 6 9 3 D M n i D l T 2 n G + r d L K 6 t r 6 R n m z s r W 9 s 7 t n V / f b K k o k o S 0 S 8 U h 2 A 6 w o Z 4 K 2 N N O c d m N J c R h w 2 g k m N 7 n f e a B S s U j c 6 2 l M v R C P B B s y g r W R f L v a D 7 E e E 8 z T Z u a n 0 n c z 3 6 4 5 d

Figure 3 :
Figure 3: Prompt generation process.The solid lines connect the intermediate processes, and the arrows point to the intermediate/final results.Input and output are highlighted in red and green respectively.[X] and [Y] denote head and tail entities respectively.

Figure 6 :
Figure6: Relation-wise KG completion performance (Hits@10) comparison on FB60K-NYT10.Labels on the x-axis are the abbreviations of relations.

Figure 7 :
Figure 7: Example of the link prediction with TAGREAL on FB60K-NYT10.Man denotes manual prompt.Optim denotes optimized prompt ensemble.Supp denotes support information.The ground truth tail entity ,

Figure 10 :
Figure 10: Performance variation of triple classification w.r.t training time."code" and "word" denote the representation of KG entities.

Table 1 :
Performance comparison of KG completion on FB60K-NYT10 dataset.Results are averaged values of ten independent runs of head/tail entity predictions.The highest score is highlighted in bold.

Table 2 :
Performance comparison of KG completion on UMLS-PubMed dataset.Results are averaged values of ten independent runs of head/tail entity predictions.The highest score is highlighted in bold.

Table 4 :
The number of queries and the ratio of triples/queries for FB60K-NYT10 and UMLS-PubMed for both datasets.

Table 5 :
Number of triples (#triples) and queries (#queries) in relations for FB60K-NYT10 and UMLS-PubMed.Triples/queries for both head prediction and tail prediction are counted."all" and "test" denote the whole dataset and testing data respectively. .

Table 6 :
Best X for different training sizes Prediction with ensemble.e 1 , e 2 , ..., e n denote the indices of entities.p 1 , p 2 , ..., p m denote the prompts in the ensemble.