Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering

Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks, based on their internal knowledge stored in parameters during pre-training. However, such internalized knowledge might be insufficient and incorrect, which could lead LLMs to generate factually wrong answers. Furthermore, fine-tuning LLMs to update their knowledge is expensive. To this end, we propose to augment the knowledge directly in the input of LLMs. Specifically, we first retrieve the relevant facts to the input question from the knowledge graph based on semantic similarities between the question and its associated facts. After that, we prepend the retrieved facts to the input question in the form of the prompt, which is then forwarded to LLMs to generate the answer. Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot. We validate the performance of our KAPING framework on the knowledge graph question answering task, that aims to answer the user’s question based on facts over a knowledge graph, on which ours outperforms relevant zero-shot baselines by up to 48% in average, across multiple LLMs of various sizes.


Introduction
Pre-trained Language Models (LMs) (Devlin et al., 2019;Raffel et al., 2020), which are trained on a large amount of text corpora with self-supervised learning, can perform closed-book Question Answering (QA) tasks that aim to answer the user's question based only on their internal knowledge in parameters, without using any external knowledge (Petroni et al., 2019;Roberts et al., 2020).Also, when we increase the LM sizes, Large Language Models (LLMs) can generate the answer for the question without any additional fine-tuning entities in the KG are verbalized (i.e., transforming the symbolic relational knowledge to the textual string) and prepended to the input question, which are then forwarded to LLMs to generate the answer.Consequently, LLMs conditioned on the factual knowledge are able to generate the factual answers, alleviating the hallucination issue, while keeping LLMs' parameters unchanged: fine-tuning is not required for knowledge updates.We refer to our overall framework as Knowledge-Augmented language model PromptING (KAPING), which is completely zero-shot and can be done with any off-the-shelf LLMs, without additional training.
While the above scheme looks simple yet effective, there is a couple of challenges.First, most retrieved triples associated with the question entities are unrelated to answer the given question.For example, when we retrieve the associated triples for the question entity (e.g., Poseidon) in Figure 1 in the Wikidata KG (Vrandecic and Krötzsch, 2014), there exist 60 triples, and most of them (e.g., genre, publication date, to name a few) are irrelevant to answer the question.Therefore, they might mislead the model into generating incorrect answers.On the other hand, the number of triples for the question entities is occasionally large (e.g., 27% samples for the WebQSP dataset (Yih et al., 2016) have more than 1,000 triples), thereby encoding all triples including unnecessary ones yields high computational costs, especially on LLMs.
To overcome such challenges, we further propose to filter out unnecessary triples based on their semantic similarities to the input question, inspired by the information retrieval (Bast et al., 2016).To be specific, we first represent the question and its associated verbalized triples in the embedding space.Then, we retrieve the small number of triples whose embeddings are more close to the input question's embedding than others.By doing so, we can prepend only the more relevant triples to the given question, which can effectively prevent LLMs from generating irrelevant answers with high computational efficiencies, unlike the one that augments all triples.Note that, our filtering approach uses offthe-shelf sentence embedding models (Song et al., 2020;Hofstätter et al., 2021); thus no additional training is required in every part of our pipeline.
We then validate our KAPING framework on Knowledge Graph Question Answering (KGQA) tasks.The results show that our KAPING significantly outperforms relevant zero-shot baselines.Also, the detailed analyses support the importance of knowledge retrieval and augmentation schemes.
Our contributions in this work are threefold: • We present a new knowledge-augmented LM prompting framework that leverages the factual knowledge from KGs, for zero-shot QA. • We propose to retrieve and augment relevant facts from KGs, based on semantic similarities between the question and its associated triples.• We validate our KAPING on KGQA benchmark datasets, on which ours impressively outperforms relevant zero-shot baselines.

Related Work
Language Model Prompting Language model pre-training, which trains Transformers (Vaswani et al., 2017) on unannotated text corpora with autoencoding (Devlin et al., 2019;Liu et al., 2019) or auto-regressive (Yang et al., 2019;Radford et al., 2018) objectives, becomes an essential approach for natural language tasks.Also, Large Language Models (LLMs) (Brown et al., 2020;Raffel et al., 2020;Chowdhery et al., 2022;Soltan et al., 2022) are able to perform zero-shot learning, for example, generating the answer for the input textual prompt, based on the knowledge stored in pre-trained parameters (Petroni et al., 2019;Roberts et al., 2020;Sung et al., 2021), without additional parameter updates as well as labeled datasets.To further improve their performances, some work (Rubin et al., 2022;Liu et al., 2022a) proposes retrieving relevant samples to the input question from the training dataset and prepending them in the prompt under few-show learning.Recent few work (Sanh et al., 2022;Wei et al., 2022a) further shows that, when LLMs are fine-tuned on a collection of instructions phrased from natural language tasks, they can have strong generalization performance on unseen zero-shot tasks.However, the knowledge inside LMs might be insufficient to tackle factual questions, which gives rise to knowledge-augmented LMs.Notably, our LM prompting is different from prompt-tuning literature (Lester et al., 2021a;Chen et al., 2022a) that additionally tunes LMs with model training (See Appendix C for discussions).
Knowledge-Augmented LMs Recent work proposes to integrate the knowledge, such as documents from unstructured corpora (e.g., Wikipedia) and facts from Knowledge Graphs (KGs), into LMs.
To mention a few, REALM (Guu et al., 2020) and RAG (Lewis et al., 2020) learn to retrieve documents and augment LMs with them.In addition, KGs could be another knowledge source, where the knowledge is succinctly encoded in the most compact form, and some methods augment such facts in KGs into LMs (Galetzka et al., 2021;Rony et al., 2022;Kang et al., 2022).However, all aforementioned approaches require massive amount of training data and model updates for downstream tasks.While more recent work (Izacard et al., 2022) shows retrieval-augmented LM can have strong performance with few-shot learning, it still requires extra training steps, which is different from ours focusing on LM prompting for entirely zero-shot.Recently, there are few studies augmenting the knowledge in the LM prompting scheme.At first, some work proposes to extract the knowledge in the parameters of LLMs themselves via prompting, and then use the extracted knowledge to answer the question (Kojima et al., 2022;Liu et al., 2022b;Wei et al., 2022b;Wang et al., 2022).However, since LLMs' parameters might be insufficient to store all the world knowledge, the extracted knowledge and generated answers might be inaccurate.On the other hand, most recently, Lazaridou et al. (2022) propose to use the Google Search to retrieve documents on the Web, and then prepend the retrieved documents to the input question along with few-shot demonstrations, to answer the question under few-shot LLM prompting schemes.However, our focus on zero-shot prompting with KGs is orthogonal to the previous study working on documents with few-shot prompting, and leveraging KGs can bring additional advantages.Specifically, since KGs can succinctly encode the knowledge in the compact triple form, for QA tasks, ours makes LLM prompting more efficient (i.e., reducing the input sequence length compared to the document case), as well as more effective on the zero-shot QA scheme: LLMs need to select one triple containing the answer entity in the prompt, instead of looking through lengthy documents having various entities.

Knowledge Graph Question Answering
The goal of our target Knowledge Graph Question Answering (KGQA) tasks is to answer the input question based on a set of facts over KGs (Chakraborty et al., 2019;Fu et al., 2020).Previous approaches are broadly classified into neural semantic parsingbased methods (Yih et al., 2015;Bao et al., 2016;Luo et al., 2018), information retrieval-based methods (Sun et al., 2018;Saxena et al., 2020;Yasunaga et al., 2021), and differentiable KG-based methods (Cohen et al., 2020;Saffari et al., 2021;Sen et al., 2021), which, however, require annotated data with additional model training.While Zhou et al. (2021) aim to transfer the KGQA model to the target language domains without any training data on them, this work indeed needs the labeled data to train the model on data-rich source domains first before transferring the model to the target domains.In contrast to all the aforementioned methods, we explore the novel zero-shot KGQA mechanism, which does not require any annotated QA pairs and additional training, leveraging LM prompting.

Method
We now describe our Knowledge-Augmented language model PromptING (KAPING) framework.

LM Prompting for Zero-Shot QA
We begin with the zero-shot question answering, and then explain the language model prompting.
Zero-Shot Question Answering Given an input question x, the Question Answering (QA) system returns an answer y, where x and y consist of sequences of tokens: x = [w 1 , w 2 , . . ., w |x| ].Let P be a QA model based on the generative Language Model (LM) (Raffel et al., 2020;Brown et al., 2020), which generates the conditional probability of answer y for question x as follows: P (y|x).Then, in contrast to supervised learning that trains model P with a set of annotated (x, y) pairs, zeroshot learning does not use any labeled samples and model training.Notably, we are interested in this zero-shot QA, since collecting the dataset and then fine-tuning the existing LMs for every new domain are known to be expensive and sometimes infeasible (Houlsby et al., 2019;Lester et al., 2021b).
LM Prompting LMs are often pre-trained by predicting the next token based on previous tokens, which is known as auto-regressive language modeling (Radford et al., 2018;Raffel et al., 2020).Then, thanks to this pre-training objective, LLMs can perform zero-shot instruction learning.Specifically, when we provide a question as well as an instruction (e.g., "Please answer the question: Who is the author of Lady Susan?") to the LLM (i.e., P ), such the LLM, conditioned by the input text, can sequentially generate the probability of output tokens, which might be an answer, "Jane Austen".
To be more formal, for every input question x, we first modify it with a particular instruction tem-plate T into a textual string x ′ called a prompt, as follows: T : x → x ′ .For example, if we have the previous question x = "Who is the author of Lady Susan?" along with the previous instruction template "Please answer the question:", the resulting prompt x ′ would be T (x) = "Please answer the question: Who is the author of Lady Susan?".Then, we forward the prompt x ′ to the LLM (i.e., P ), which then generates the answer (i.e., y) through P (y|x ′ ).Note that this LM prompting scheme does not require any additional model parameter updates (i.e., fine-tuning) on the labeled data, thus appropriate for the target zero-shot QA task.
However, there are multiple challenges in this naive zero-shot prompting for QA.First, LLMs, which rely on the knowledge in parameters, are vulnerable from generating the factually incorrect answer, since the knowledge in LLMs might be inaccurate, and outdated: knowledge can be emerged and changed over time.Also, refining the internalized knowledge with additional parameter updates is expensive, while it is necessary to reflect the wrong and ever growing knowledge.Lastly, which knowledge LLMs memorize and utilize when generating the answer to the question prompt is unclear, which limits their explainability on the outputs.

Knowledge-Augmented LM Prompting
In order to tackle the aforementioned limitations of the existing LM prompting scheme, we propose to inject the relevant knowledge to the input question from the Knowledge Graph (KG), which we refer to as Knowledge-Augmented language model PromptING (KAPING).In this subsection, we first define the main objective of our KAPING framework, and then introduce the ingredients for augmenting the knowledge over KGs to LM prompts.

LM Prompting with Knowledge Graphs
Instead of relying on the knowledge internalized in parameters, we propose to additionally access and inject the knowledge from the external KG, which contains accurate and up-to-date facts helpful to answer the question.Formally, a knowledge graph G consists of a set of factual triples {(s, r, o)}, where s and o denote subject and object entities, and r is a specific type of a relation between them.For example, one relational knowledge "Lady Susan was written by Jane Austen" can be represented as a triple consisting of two entities s = "Lady Susan" and o = "Jane Austen" along with a relation r = "written by".Then, for the question prompt x ′ transformed from the example question x = "Who is the author of Lady Susan?" via the template T , we additionally augment its relevant triple: (Lady Susan, written by, Jane Austen), to the LM prompting scheme.By doing so, LLMs can generate the correct answer with regard to the augmented knowledge from KGs, formalized as follows: P (y|x ′ , G).Note that, since we can provide specific and valid facts in KGs to LLMs whenever they exist, our framework can alleviate hallucination issue, originated from inaccurate and outdated knowledge in LLMs, without costly updating their model parameters.Furthermore, we can confirm whether LLMs generate answers based on augmented facts, thus improving the explainability of LM prompting.
The remaining questions are then how to access the relational symbolic facts over the KG from the input question, verbalize the symbolic knowledge to the textual string, and inject the verbalized knowledge into the LM prompting scheme.We explain them one by one in the following paragraphs.

Knowledge Access
In order to utilize the related facts to the input question, we first extract the entities in the question.For example, for the question "Who is the author of Lady Susan?", we extract the entity "Lady Susan".Then, based on the extracted entity, we find its corresponding entity over the KG, whose incident triples then become associated facts to the input question.Note that entity matching can be done by existing entity linking techniques (Wu et al., 2020;Li et al., 2020;Ayoola et al., 2022).
Knowledge Verbalization LLMs are working on textual inputs, whereas factual triples are represented over the symbolic graph.Therefore, before injecting the symbolic fact from KGs to LLMs, we first transform the triple consisting of (s, r, o) into its textual string, called verbalization.While there exists recent methods (Oguz et al., 2022;Ma et al., 2022) that particularly design or even learn the graph-to-text transformation, in this work, we use the linear verbalization: concatenating the subject, relation, and object texts in the triple, which we observe works well in LM prompting (See Appendix B.5).For instance, one triple (Lady Susan, written by, Jane Austen) is used as is: "(Lady Susan, written by, Jane Austen)", for an LLM's input.
Knowledge Injection Based on verbalized facts associated with the input question, the remaining step is to realize the knowledge injection mechanism, which allows LLMs to be grounded on the external knowledge, useful to generate the answer.Let assume we have a set of N associated triples k = {(s i , r i , o i )} N i=1 for question x.Then, similar to instruction template T : x → x ′ described in Section 3.1, we modify N verbalized triples k along with the instruction for the knowledge injection into the knowledge prompt k ′ , as follows: T : k → k ′ .One particular template we use for constructing the prompt is that, we first enumerate N verbalized triples line-by-line and then add the specific instruction: "Below are facts in the form of the triple meaningful to answer the question.",at the top of the prompt.After that, such the knowledge prompt string, k ′ , is prepended to the question prompt x ′ , and LLMs conditioned by knowledge and question prompts then sequentially generate the answer tokens, formalized as follows: , where [•] denotes concatenation.

Question-Relevant Knowledge Retrieval
The proposed KAPING framework in Section 3.2, allows LLMs to leverage the knowledge from KGs for zero-shot QA.However, there are critical challenges that the number of triples associated to questions is often too large to forward in LLMs.Also, most of them are unrelated to the question, misleading LLMs into generating the irrelevant answer.
Knowledge Retriever To overcome those limitations, we further propose to retrieve and augment only the relevant triples to the question.Note that there exists a document-retrieval scheme (Lin et al., 2021), whose goal is to retrieve relevant documents for the given query based on their embedding similarities, which motivates us to retrieve, in our case, the triples for the user's question.In particular, thanks to the verbalizer defined in Section 3.2, we can play with triples, obtained from symbolic KGs, over the text space.Therefore, for the verbalized triple and the question, we first embed them onto the representation space with off-the-shelf sentence embedding models for text retrieval (Song et al., 2020;Karpukhin et al., 2020;Xiong et al., 2021), and then calculate their similarities.After that, we use only the top-K similar triples, instead of using all N triples, associated to the given question.Note that, unlike few recent studies (Oguz et al., 2022;Ma et al., 2022;Kang et al., 2022) that aim at improving KG retrievers themselves under supervised training, we focus on zero-shot LM prompting with KGs, thus we use any off-the-shelf retrievers as a tool to filter out unnecessary triples for questions.

Experimental Setups
We explain datasets, models, metrics, and implementations.For additional details, see Appendix A.

Datasets
We evaluate our Knowledge-Augmented language model PromptING (KAPING) framework on two Knowledge Graph Question Answering (KGQA) datasets, namely WebQuestionsSP and Mintaka.
Mintaka This dataset (Sen et al., 2022) is recently designed with the Wikidata KG for complex KGQA tasks.Among 8 different languages, we use English test sets consisting of 4,000 samples.

Baselines and Our Model
In this subsection, we explain four zero-shot LM prompting baselines and our KAPING framework.
No Knowledge This is a naive LM prompting baseline, which generates answers from input questions without knowledge augmentation from KGs.
Random Knowledge This is an LM prompting baseline, which additionally augments the randomly sampled K triples, associated to the entities appeared in the question, to the prompt.
Popular Knowledge This is an LM prompting baseline, which augments K popular triples among all triples from the question entities, based on relations that appear the most frequently in the KG.Generated Knowledge This is an LM prompting baseline, which first extracts the knowledge from LLMs themselves based on prompting, and then   (Song et al., 2020), on 1-and 2-hop retrievals.
OPT (6.7B) T0 (11B) 20  augments them as the form of the prompt (Liu et al., 2022b), which is similar to Kojima et al. (2022).KAPING (Ours) This is our Knowledge Augmented language model PromptING (KAPING) framework, which first retrieves the top-K similar triples to the question with the knowledge retriever, and then augments them as the form of the prompt.

Evaluation Metrics
Generation Following the evaluation protocol of generative KGQA (Yin et al., 2016;Sen et al., 2022;Mavi et al., 2022), we use accuracy, which measures whether the generated tokens from the given prompt include one of the answer entities.Note that we further consider aliases -a set of alternative names -of answer entities available in Freebase and Wikidata KGs, for evaluation.
Retrieval We also measure the retriever performance, to see how much the retrieved triples are helpful for answer generation.As metrics, we use Mean Reciprocal Rank (MRR) and Top-K accuracy (Top-K), which are calculated by ranks of correctly retrieved triples containing answer entities among all triples associated to question entities.

Implementation Details
For the knowledge injection, we set the number of retrieved facts as 10 (K = 10), and the hop for triple retrieval as one.For the text-based retriever, we experiment with MPNet (Song et al., 2020) that uses the same encoder for embedding question and triples.See Appendix A.4 for additional details.

Experimental Results and Analyses
We provide the overall results of our KAPING framework along with its comprehensive analyses.

Main Results
As shown in Table 1, our KAP-ING framework significantly outperforms all LM prompting baselines, on zero-shot KGQA tasks.In particular, the generated knowledge model mostly degenerates the performance compared to the no knowledge model, since the extracted knowledge from LLMs themselves might be inaccurate.On the other hand, the random and popular knowledge baselines bring performance improvements, since the augmented knowledge from KGs are sometimes useful to answer the question.However, ours outperforms them, which suggests that, for zero-shot LM prompting for QA, the knowledge internalized in LLMs is insufficient to generate factual answers, and it is important to use only the relevant facts.
In addition, we also observe larger performance improvements when LMs are relatively small.In other words, since smaller models have insufficient parameter spaces to memorize the knowledge during pre-training, they are more likely to generate factually incorrect answers.However, when the appropriate knowledge is given to them, their performances sometimes become similar to larger models (e.g., different sizes of OPT have similar performances by our KAPING).Therefore, for tasks that require factual knowledge under low-resource setups (e.g., production), augmenting the knowledge would be beneficial, instead of increasing model sizes to handle the huge volume of knowledge.

Retriever Results
To see how relevant the augmented knowledge is, we further measure the retrieval performances.As shown in Table 2, the existing retrieval model (i.e., MPNet) shows superior performances against naive models: random and popular retrievers.This result suggests that our simple graph-to-text verbalization works well with the existing retriever, which further confirms that our KAPING augments useful facts in the LM prompt.Regarding the number of hops for the candidate triples to retrieve, we observe that, when we increase the hop-size from one to two, the retriever is more likely to retrieve irrelevant triples that does not include answer entities, as shown in Table 2. Therefore, in our experiments, we retrieve knowledge among 1-hop triples of question entities.
Additionally, since we can alternatively answer the input question based on entities in the Top-1 triple from the retriever, we compare the generation performance of LLMs to the retrieval performance.As shown in Figure 2, LM prompting schemes even without knowledge augmentation (i.e., no knowledge) are superior than simply answering with the entity in the retrieved triple, except for the We-bQSP w/ Freebase dataset.Also, we observe huge gaps between our KAPING framework and the simple retrieval scheme on all datasets.These results suggest that, for zero-shot KGQA, it would be helpful to leverage LLMs to generate answers based on their internalized and external facts, instead of directly searching answer entities over KGs.

Impact of Correct & Incorrect Retrievals
We conduct analyses on how much the correctly retrieved triples, having answer entities, bring performance improvements, and how performances are affected by the incorrectly retrieved triples, which   Varying the Amount of Knowledge We change the number of facts, to see which triple amounts are optimal to augment in the prompt, by comparing trade-off between the generation performance and the wall-clock time.First of all, as shown in Figure 5, most LLMs reach the somewhat highest performance, when the number of triples is 5 or 10.Also, when we further increase the augmented triple size to 15 and 30, performances of OPT models are largely decreasing.This result suggests that some LMs might be distracted by irrelevant triples when their volumes are high, therefore, failing to select and generate the answer entity.
We then measure the wall-clock time of the answer generation, for the encoder-decoder (T0) and decoder-only (OPT) models with varying the number of augmented triples in the prompt.As shown in Table 3, regarding the encoder-decoder model, our KAPING framework with less than 10 triples is faster than the model without knowledge augmentation.We observe this is because, when the knowledge is augmented to the model, the model tends to generate shorter answers, which can reduce the decoding time.More specifically, the length of generated tokens for the T0 model with 10 triples is Table 4: Generation examples of the prompted GPT-3 for the input question with augmented triples from the retriever, where, in the last row, we change the knowledge of augmented facts to see whether the model is able to adapt to the changed knowledge.
Question: Where did Alex Chilton die?Retrieved triples: (Alex Chilton, place of death, New Orleans), (Alex Chilton, manner of death, natural causes), (Alex Chilton, cause of death, myocardial infarction), (Alex Chilton, date of death, time: +2010-03-17), .15, whereas, the no knowledge model generates 32 tokens in average.However, for the decoder-only model (OPT), the more knowledge we augment, the slower the model becomes, because of its autoregressive characteristic for digesting the input.

Impact of Orders of Retrieved Triples
In fewshot LM prompting where LLMs additionally observe few examples in the prompt, they are known to be sensitive to the order of examples (Lu et al., 2022), and they tend to follow the answer in the last example (Zhao et al., 2021).Based on those observations, we also conduct an analysis on whether the order of retrieved triples affects the performance.
In particular, we vary the location of more similar triples for the question, by locating them at the Top, Bottom, or Random position of the prompt.As shown in Figure 4

Case Study
We conduct a case study in Table 4.
In particular, when the knowledge is not given to the LM, it hallucinates the factually incorrect answer.However, when related facts are retrieved and augmented in the prompt, it can generate the correct answer.In addition, we analyze whether our KAPING can adapt to the updated knowledge, motivated by that some knowledge can be changed over time, while the knowledge in LMs remains static.To do so, as shown in the last row of Table 4, we replace object entities of triples, and then forward the prompt with the modified facts to the LM.Then, the result shows that the LM can generate the output based on the updated facts, which suggests the potential of adapting LMs without costly updating their parameters. Additional

Conclusion
In this work, we focused on the limitation of existing LM prompting schemes, which rely on the static knowledge internalized in parameters; therefore, when such knowledge are incomplete, inaccurate, and outdated, LLMs may generate factually incorrect answers.To tackle this challenge, we introduced a novel Knowledge-Augmented language model PrompTING (KAPING) framework, which augments the knowledge for the input question from KGs directly in the input prompt of LLMs, with the fact retriever to inject only the relevant knowledge.The proposed framework is completely zero-shot, and versatile with any LMs, without additional parameter updates and training datasets.
We validated that our KAPING yields huge performance gaps from the LM prompting model relying on its internal knowledge, especially with smaller LMs, on the KGQA tasks.We believe our new mechanism for augmenting facts from KGs to the LM prompt will bring substantial practical impacts in generating knowledge-grounded answers.

Limitations
In this section, we faithfully discuss the current limitations and potential avenues for future research.First of all, the generation performance of our knowledge-augmentation framework largely depends on the efficacy of retrievers.In other words, if the retriever fails to retrieve the relevant facts to the input question, the prompted LLM, conditioned on the irrelevant facts, is likely to generate the incorrect answer (See Figure 3).Similarly, if the retriever is not designed to retrieve the facts in 2-hop neighborhoods of the question entities, LLMs are less likely to generate the answer requiring 2-hop knowledge.Note that, for the Mintaka dataset (Sen et al., 2022), the number of answerable questions with 1-hop facts is only 40% of total samples.However, when we include 2-hop triples, the number of answerable questions becomes 62%, which suggests the necessity of 2-hop retrievals, which is yet challenging (See Table 2).Thus, future work may improve the retrieval scheme itself to provide more accurate facts including multi-hops to the LLM, or may develop the mechanism to prevent the LLM from being misled by unrelated facts.
On the other hand, the evaluation metric for the generation performance of prompted LLMs may be further improved.Specifically, regarding our target KGQA tasks, the answer for the question is the entity in KGs.However, the prompted LLMs without additional training (i.e., zero-shot) tend to generate the answer as the sentence.For instance, the label entity for the question (e.g.,Where did Alex Chilton die?) in Table 4 is "New Orleans", however, the LLMs often generate the sentence-level output: "Alex Chilton died on March 17, 2010 in New Orleans, Louisiana due to a myocardial infarction".We currently evaluate the model performance by measuring whether generated tokens contain the answer entity or not; however, it would be worthwhile to develop the additional metric to compare the sentence-level output from LLMs to the word-level answer in KGs in a more effective way.Note that we also try other available metrics (See Appendix B.3), such as F1 and Exact Match (EM) scores (Rajpurkar et al., 2016), however, they largely penalize the longer sentences (e.g., EM of correct examples in Table 4 are 0), thus may not be appropriate for evaluating LM prompting schemes.
Lastly, since we focus on the improvement of knowledge injection in LM prompting, we use the labeled entities in KGQA datasets when evaluating models, following the existing KGQA evaluation setups (Cohen et al., 2020;Sen et al., 2021).However, in real-world applications where the entities in the question are mostly not provided, we first need to extract entities in the question with existing entity linking techniques; therefore, our model performance depends on the efficacy of entity linking.In particular, regarding the result with entity linking in Table 5, the portion of answerable questions from labeled entities in the dataset is 40%, however, the portion of them with entities from the entity linking model (Ayoola et al., 2022) is 22%.Therefore, since the improved entity linking performance would contribute to the performance gain of our KAPING framework, for KGQA tasks, future work may advance such the entity linking scheme.

Ethics Statement
For a user's question, our knowledge-augmentation scheme can allow prompted LMs generate a factually correct answer, grounded by the provided knowledge, for KGQA tasks.However, the performance of our KAPING framework is still far from perfect, due to potential failures in entity linking, fact retrieval, and knowledge generation itself.Thus, we should be aware whether LMs generate correct answers, especially on high-risk domains.

A Additional Experimental Setups
Here we provide additional experimental setups.

A.1 Datasets
We provide the additional details for two Knowledge Graph Question Answering (KGQA) datasets, namely WebQuestionsSP and Mintaka, which we use for evaluating baselines and our model.
Mintaka This dataset (Sen et al., 2022) is designed for complex KGQA tasks including superlative and comparative questions, where questionanswer pairs are collected from crowdsourcing with Wikidata entities (Vrandecic and Krötzsch, 2014).

A.2 Large Language Models
We describe the specific details of Large Language Models (LLMs) that we use for LM prompting.
T5 This model (Raffel et al., 2020) is an encoderdecoder model, and, among different variants, we use the LM-adapted version1 , which is additionally pre-trained with auto-regressive language modeling objective (Radford et al., 2018) for LM prompting.T0 This model (Sanh et al., 2022) is further finetuned from T5 (Raffel et al., 2020) over prompted text-to-text tasks, for improved zero-shot generalization performance with LM prompting.
GPT-3 This model (Brown et al., 2020) is a decoder only model, which we access via API2 .
OPT This model (Zhang et al., 2022) is a decoder only model, freely available for researchers.
AlexaTM This model (Soltan et al., 2022) is an encoder-decoder model, pre-trained with denoising, which reconstructs the context of 15% dropped tokens, and auto-regressive, which predicts the next tokens based on their previous tokens, objectives.

A.3 Evaluation Metrics
We provide more details for evaluation metrics.
Aliases For generative question answering tasks, there can be alternative names of entities, called aliases, and we consider them for evaluation.For example, one Wikidata entity, "William Shakespeare" (Q692), has alternative names, such as "Shakespeare" and "The Bard", and we consider them when measuring the generation performance.
Filtering Unnamed Entities For evaluating generative models, the name of entities are required.However, we sometime cannot find the name of the answer entities from their ids on Freebase and Wikidata KGs.This is because the annotated answer entities are sometimes not entities but categories, and the entity ids in KGs could be changed but we cannot find the KG dumps that are used to annotate datasets.Therefore, we filter out samples that do not have literal name texts for the answer entities.This filtering step results in 1,582 test samples for the WebQSP w/ Freebase dataset, 1,466 test samples for the WebQSP w/ Wikidata dataset, and 2,814 test samples for the Mintaka dataset.

A.4 Implementation Details
In this subsection, we provide additional details for implementing our KAPING framework.
Knowledge Injection Schemes There are different choices in knowledge injection schemes, from the number of facts to retrieve, to the number of hops for candidate triples, to the order of retrieved facts in the prompt (i.e., where the most relevant knowledge should be located in the prompt), to the template of prompts including their instruction texts.While search spaces of them are extremely huge, we aim to to find the optimal one (See analyses in Section 5).Specifically, as reported in Section 4.5, the best settings we find are the number of retrieved facts of 10, and the number of hops for the triples to retrieve from the question entities of one.Also, we locate more relevant triples to the input question closer to the question text in the prompt, inspired by the observation that the model tends to rewrite answers that appeared at the end of the prompt (Zhao et al., 2021).Further, we examine different instruction templates for generating answers, such as "Question: {x} Answer: " or "Please answer the following question: {x}", where x is the literal question.Regarding instruction templates, we observe that the performances of LLMs are sensitive across different instructions (See Appendix B.2), therefore, we try both of them and then report the best result.Retrieval Models To augment only the relevant triples to the input question under the zero-shot setup, we use off-the-shelf text-based retriever models.Specifically, we experiment with two different types of retrievers: symmetric retriever that uses the same encoder for question and triples; asymmetric one that uses individual encoders for them.For the symmetric retriever, we use MPNet (Song et al., 2020), which is trained on 1B sentence pairs 3 .Also, for the asymmetric retriever, we use TAS-B (Hofstätter et al., 2021), which is trained on the MS-MARCO dataset (Nguyen et al., 2016).We mainly report the results with MPNet, unless noted, since there performances are similar (See Appendix B.1).

A.5 Hyperparameters and Resources
We evaluate all models with PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2020) libraries.We set the maximum number of input token lengths of LMs as 1,024 and the maximum number of output token lengths as 128, for encoderdecoder models.For decoder-only models, we set the maximum token lengths as 1,152 (1,024 + 128).
For computing resources, we run all models with 8 V100 GPUs, having 8 × 32GB GPU memory, in which every model is runnable within one day.Note that, due to the expensive computational costs for model prompting with LLMs, we run every model one time, and then report the results, without additional hyperparameter tuning unless noted.

B Additional Experiment Results
In this section, we provide additional experimental results, on the comparisons of available text-based retrieval models in Section B.1, the sensitive analyses on template texts of the prompt in Section B.2, and the extra evaluation metrics in Section B.3.

B.1 Performance Comparisons of Retrievers
In As shown in Table 6, we observe similar performances between symmetric (MPNet) and asymmetric (TAS-B) retrievers, which suggests that our simple graph-to-text verbalization is robust across different text-based retrieval schemes.Note that, since retrieval performances of both are similar, we conduct experiments mainly with MPNet, to reduce expensive computational costs for GPU usages.

B.2 Sensitivity Analyses on Template Texts
Following the observation in Zhao et al. (2021), the performances of LLMs vary across different templates in the prompt.In our experiments, since it is computationally infeasible to try all different prompt templates on various LLMs, we consider two types of question templates, described in Appendix A.4.In particular, for the question x, we use either "Question: {x} Answer: ", which we refer to as default template, or "Please answer the following question: {x}", referred to as please template.As shown in Table 7, for the T5 model, the default template is superior than the please template.Meanwhile, for the OPT model, the please template is superior than the other.However, for T0 and GPT-3 models, performance differences between default and please templates are marginal.Therefore, these results suggest that we may need to select instruction templates carefully across different LLMs for achieving optimal performances.Additionally, regarding the knowledge-injection template described in Section 3.2, we also observe that the generation performance of GPT-3 depends on the instruction text in the template.In particular, we mainly conduct experiments with the template: "Below are facts in the form of the triple meaningful to answer the question.";however, we observe the performance degeneration when the augmented triples are irrelevant to the given question as shown Table 8: LM prompting results with additional metrics: F1 and Exact Match (EM), along with accuracy (Acc.)scores.
in Figure 3. Therefore, to improve the performance on incorrect retrievals, we further experiment with the additional template: "Below are facts in the form of the triple that might be meaningful to answer the question.".Then, the GPT-3 (175B) model with the previous template achieves 74.16 and 42.80 accuracies for correct and incorrect retrievals, respectively.Meanwhile, the same model with the instruction template containing "might be" achieves 72.91 and 51.38 accuracies for correct and incorrect retrievals, respectively.Thus, these results suggest that the knowledge-injection template with "might be" statement makes the model less selective on the augmented triples while focusing more on the internalized knowledge in parameters, thus improving the incorrect retrieval performance while degenerating the correct retrieval.

B.3 Additional Evaluation Metrics
As described in Section 4.4, we evaluate the performance of LLMs based on whether generated tokens for the input question contain answer entities or not.This is because, as explained in Section 6 of the limitation, pre-trained LLMs without further finetuning tend to generate the answer as the sentence, while the answer for the KGQA task is the entity consisting of few tokens.In this subsection, we further provide experiment results with additional evaluation metrics (Rajpurkar et al., 2016), namely F1 and Exact Match (EM) scores.Note that they are frequently used for evaluating extractive QA models, whose goal is to classify the answer span in the given context, without generation.As shown in Table 8, since the F1 score penalizes the longer sentence too much, the performances of LLMs evaluated by F1 scores are largely decreasing, except for the T0 model that is further fine-tuned by prompted text-to-text tasks, including QA, thus capable of generating entity-level outputs.Similarly, except for the T0, it is highly suboptimal to evaluate the performance of prompted LMs with EM scores, due to differences in output lengths.Thus, it would be promising direction to further develop better evaluation metrics for KGQA under LM prompting schemes, which we leave as future work.
While such F1 and EM scores, used for extractive QA tasks, might be suboptimal to evaluate generative LM prompting schemes, our KAPING framework consistently outperforms all the other baselines based on averaged F1 and EM scores as well, by large margins.Note that the superior EM and F1 scores of the generated knowledge baseline with GPT-3 on few cases, even though they are rarely happen, is because, for this baseline, the GPT-3 model generates entity-level outputs, unlike ours that generates sentence-level outputs.In other words, the sentence-level outputs from our KAP-ING is often longer than the answer entities, since our model is grounded by retrieved facts from KGs as shown in Table 15; however, longer sentences penalize F1 and EM scores.More specifically, the average number of output sequence lengths of the Human Evaluation Additionally, similar to the previous generative QA work (Roberts et al., 2020), we manually inspect 30 samples from the WebQSP w/ Freebase dataset, to see whether the generated sentence is factually correct to the input question.
For this experiment, we evaluate four LLMs: T0 (3B), T0 (11B), GPT-3 (6.7B), and GPT-3 (175B), with no knowledge baseline and our KAPING.Also, we use three different ratings for each generation example: 1) we label it as correct if all information in the generated sentence is factually correct to the question; 2) we label it as semi-correct if some information in the generated sentence is factually incorrect which yet contains at least one answer entity; 3) we label it as incorrect for all the other cases.As shown in Table 9, we observe that our KAPING framework can generate the factually correct answer more, compared to the no knowledge baseline, which are consistent with the results from available evaluation metrics in Table 1 and Table 8.We provide generated answers, which we use for human evaluation in Table 9, for GPT-3 (175B) and T0 (3B) models in Table 15 and Table 16.

B.4 Performances of Few-Shot Learning
While the focus of our work is zero-shot as outlined in the main paper, in this subsection, we additionally extend this zero-shot setting to the few-shot  (2022) for obtaining free-form texts.For triple-form texts, we use the verbalization technique described in Section 3.2.
We then inject the verbalized triples in the input prompt.We report the generation accuracy on WebQSP w/ Wikidata.
setting, where we prepend the few examples about the input-output pairs in the prompt of LLMs.As shown in Table 10, for the KGQA task, the performances are decreasing when we increase the number of samples (i.e., shots) in the input prompt, except for the OPT model.We suggest this might be because, the injected examples in the prompt are less relevant to the given factual question, misleading the model to focus on unrelated contexts on the injected examples.This phenomenon is even more severe in our KAPING framework; this is similarly because our KAPING augments the retrieved facts, and if the facts on the other few-shot examples are further injected in the input prompt, the model is more likely to be confused by those irrelevant facts.
For the OPT model, we observe a slight performance improvement in the No Knowledge model, since few injected examples provide a hint on how the output format looks like.We leave further extending our zero-shot KAPING framework to the few-shot learning mechanism as future work.

B.5 Analyses on Knowledge Verbalization
As described in the Knowledge Verbalization paragraph of Section 3.2, we use the linear triple verbalization technique, which simply concatenates the tokens of subject, relation, and object in the triple, instead of using the sophisticated techniques that use the particular graph-to-text transformation methods (Oguz et al., 2022;Ma et al., 2022).This is because, we observe that our simple verbalization technique works well, and, in this subsection, we concretely show performance differences between our and existing verbalization techniques in Table 13: Efficiencies results, where we measure the wall-clock time of every model for generating answers on the WebQSP w/ Wikidata dataset.The document augmentation model (Lazaridou et al., 2022) augments documents listed in their paper, meanwhile, ours augments relevant triples to the question retrieved from KGs.We set the maximum number of input sequences for T5 and T0 models as 1,024, and for OPT as 2,048.OOL denotes the out-of-length errors, where the input prompt length exceeds the maximum input token lengths.OOM denotes the out-of-memory error on the machine having eight V100 GPUs.
both the knowledge retrieval and injection steps.Note that, for the comparison, we use the trained knowledge verbalizer proposed in Ma et al. (2022).
We first provide the fact retrieval performances across the different knowledge verbalization methods in Table 11.As shown in Table 11, we observe that our simple triple-form text verbalization is superior to the free-form text verbalization in the fact retrieval.This might be because the free-form verbalization model, transforming the graph to the text, might generate the incorrect output that is semantically different from the original triple, leading to the degenerated retrieval performances.
On the other hand, we also report the generation results of KGQA with two different knowledge verbalizers on our KAPING framework in Table 12.As shown in Table 12, we observe that the performances between the free-form texts and the triple-form texts are comparable when augmented to LLMs with our KAPING framework.More specifically, for the T5 model, which is pre-trained on the unlabeled corpus without additional instruction tuning, the free-form text works well.Meanwhile, for the T0 model, which is further fine-tuned with natural language instruction tasks, it is beneficial to use our linear triple verbalizaton scheme.

B.6 Additional Efficiency Comparisons
In this subsection, we further provide efficiency results of all LLMs that we use in our main experiments across three different models: no knowledge model, document augmentation (i.e., web augmentation) model (Lazaridou et al., 2022), and our KAPING framework.We note that, as discussed in the Knowledge-Augmented LMs paragraph of Section 2, the web augmentation method augments documents searched from Google with the fewshot learning setup.However, as we discuss there, this web augmentation is orthogonal to ours, since we use the completely different knowledge source (i.e., KGs) and our work is under the zero-shot learning setup; from which our core mechanisms of how to retrieve and augment relevant knowledge with LM prompting is clearly different and novel.Furthermore, as discussed in Section 2, this web augmentation method is infeasible to experimentally compare as well, since individual researches cannot freely access the Google Search API to retrieve documents for every question in the world.Also, it is computationally expensive to augment documents consisting of hundreds to thousands tokens (Lazaridou et al., 2022) in LLMs, unlike our triple cases consisting of few tokens.In this subsection, to experimentally validate the latter issue, we further make the comparisons of computational costs between document augmentation and our fact augmentation.In particular, as shown in Table 13, the answer generation speed of the web augmentation mechanism is significantly slower than our triple augmentation mechanism, since it requires more time to encode and condition documents in the input prompt compared to triples.Also, following the original paper (Lazaridou et al., 2022), the suggested number of documents to augment is 15, however, in the most cases, we observe out-oflength (OOL) errors, since the length of the input prompt with 15 documents is longer than the maximum input sequence length of LLMs.While our fact augmentation scheme is slower than the model without augmentation, we believe that, given the substantially improved performance in Table 1 and the high efficiency compared to document augmentation in Table 13, KAPING is highly beneficial.

B.7 Result Analyses Across Question Types
For the Mintaka dataset (Sen et al., 2022), each question is belong to one of the following categories: Generic, Multihop, Intersection, Difference, Comparative, Superlative, Ordinal, Count, and Yes/No, which defines the complexity of ques-tions.Therefore, to see which complexity category our knowledge-augmentation framework is helpful, and which category we should further improve on, we breakdown the performance of LLMs according to question types in Table 14.Note that, following the evaluation protocol in Section A.3 where we filter out questions that do not have answer names, the Yes/No type questions are not considered.
As shown in the last row of Table 14 where we average the performance of all LLMs per category, our KAPING framework brings significant performance improvements on all categories except for the Comparative type.One particular comparativetype question is "Who has won more NBA Season MVPs, LeBron James or Steph Curry", and, since it is hard to retrieve and associate relevant triples for such the comparative-type question, our KAPING underperforms simple knowledge-injection baselines: random knowledge and popular knowledge.However, the KG-augmented models (e.g., random knowledge, popular knowledge, and our KAPING) outperform other baselines, which suggests that knowledge-augmentation mechanism is meaningful to tackle comparative questions, and one might further improve the retrieval scheme or the input prompt itself, which we leave as future work.
On the other point we would like to mention is that, for the Count category, performances of T0 models are significantly low compared to other LLMs.This is surprising, since T0 models are further fine-tuned on the prompted text-to-text tasks, and they have strong performances on the other categories, thanks to fine-tuning.We believe such the low performance on the Count category is because, in the fine-tuning of T0 models, there are no prompted tasks related to counting, which makes T0 models hard to count particular instances.Therefore, to further improve the generalization performance of T0 models, one may additionally include more diverse prompted tasks, including the counting one, during the fine-tuning process.

B.8 Generation Examples
We provide generation examples comparisons between the no knowledge baseline and our KAP-ING framework in Table 15 and Table 16 for GPT-3 and T0 language models, respectively.We also provide retrieved and generation examples of our KAPING framework with four different LLMs: T5 (11B), OPT (13B), T0 (11B), and GPT-3 (175B) on the WebQSP w/ Wikidata dataset in Table 17.

C Discussions on Prompt Design/Tuning
We discuss differences between prompt design and prompt tuning, along with additional relevant work in the prompt tuning literature.As described in Section 3.1, given an input question, the large language model can generate the answer text, which is called LM prompting (Brown et al., 2020;Liu et al., 2021).However, to further enhance the performance of models under the LM prompting scheme, prior work particularly designs the content in the prompt, which is called prompt design (Shin et al., 2020;Lu et al., 2022).More specifically, Shin et al. (2020) additionally include the particular trigger tokens, meaningful to the down-stream tasks, in the prompt, and Lu et al. (2022) change the order of demonstrations in the prompt under the few-shot LM prompting setup.Our method is in line with such the prompt design literature, and we introduce the method of knowledge augmentation in the input prompt with facts from KGs, to allow LLMs condition on factual knowledge for zero-shot QA.
On the other hand, there exists prompt tuning literature (Lester et al., 2021a), which additionally trains the prompt-relevant parameters with supervised learning objectives, while keeping the parameters of LLMs unchanged.While this prompt tuning approach can be beneficial in few-shot learning scenarios where the model is additionally tuned with few training examples, it is not suitable for our zero-shot learning.Also, unlike the prompt design approach, it is difficult to interpret and manipulate the prompt represented in the embedding space.
Note that, recently, there are few knowledgeaware prompt tuning work (Chen et al., 2022b;Hu et al., 2022;Chen et al., 2022a), and, while they are fundamentally different from our LM prompting (i.e., prompt design), we additionally discuss them.First of all, Chen et al. (2022b) tackle the relation extraction problem with prompt tuning, where they propose to embed the particular words related to the relation class in the embedding space.For example, for the relation type to classify: "county of birth", they embed person and country information in the representation space with training signals from supervised learning, for improved relation classification performance.Also, Hu et al. (2022) tackle the text classification task with prompt tuning, where they propose to not only consider the classification label word itself, but also the label word's related words.For example, for the sentence label "science", they further consider its related words: "physics" and "mathematics", defined in particular knowledge bases, such as WordNet (Pedersen et al., 2004) and ConceptNet (Speer et al., 2017).Lastly, Chen et al. (2022a) tackle the similar text classification task with prompt tuning, where they propose to retrieve the data instance (i.e., a sentence and its label) in the training dataset based on the retriever training with supervised classification objectives.
However, all the above knowledge-aware prompt tuning methods are clearly different from our proposed KAPING framework.At first, they are restricted to cloze-style prediction, in which they first include the particular mask token in the input prompt, and then classify the label (e.g., sentiment of the sentence, or relation in the given sentence) of the mask token, similar to the masked language modeling objective (Devlin et al., 2019;Liu et al., 2019).Therefore, their cloze-style prediction schemes cannot be used for QA tasks, since the answer of the user's question is not the single token, and it is unclear to convert the predicted label token from the masked token to all different answers in the world.In contrast to them, our KAPING does not rely on the masked token classification scheme, thus ours is more flexible, and not restricted to cloze-style classification; suitable for answering any user's questions.Furthermore, some of them (Chen et al., 2022a,b)    He is an American football player.

Figure 2 :
Figure 2: Comparisons of retrieval and LM prompting.Retrieval is the Top-1 result of the MPNet (Song et al., 2020).

Figure 3 :Figure 4 :
Figure3: Comparisons of correct and incorrect retrieval for the generation performance on the GPT-3 (6.7B) model.

Figure 5 :
Figure 5: Performances with varying knowledge amount, where we change the number of retrieved triples to augment.

Table 1 :
Main results of language model prompting, where we report the generation accuracy.The number inside the parentheses in the first row denotes the parameter size of language models, and best scores are emphasized in bold.

Table 2 :
Retriever results.We compare random model, popular model, and MPNet

Table 3 :
Efficiencies with varying the knowledge amount, where we measure the wall-clock time of every model for generating the answer on the WebQSP w/ Wikidata dataset.
tion.These results suggest, when relevant knowledge is augmented, LLMs can contextualize and generate answers accurately.Meanwhile, incorrectly retrieved knowledge makes LLMs condition on irrelevant facts, and generate wrong answers.

Table 5 :
.. Results with entity linking, where the model w/ EL uses entities extracted from the entity linking technique(Ayoola et al., 2022), instead of using labeled ones, on Mintaka.
, our KAPING is not sensitive to the location of retrieved triples, except for the OPT model on the WebQSP dataset.In other words, the OPT model tends to generate the entity located at the first part of the prompt input.Meanwhile, other LLMs can contextualize the entire prompt input, and generate the entity regardless of its position.model,namelyReFinED(Ayoolaetal.,  2022).As shown in Table5, while the performance of KAPING w/ EL is slightly decreasing from the model with labeled entities due to the performance of EL, we consistently observe meaningful performance improvements from a No Knowledge model.
Effectiveness with Entity Linking Following the conventional KGQA evaluation(Cohen et al.,  2020), we use question entities labeled in datasets, to retrieve facts in KGs.However, to see the performance with entities identified by Entity Linking (EL) technique, we further conduct experiments with the EL

Table 7 :
Results with varying instruction templates, for various LLMs on the WebQSP and Mintaka datasets.

Table 9 :
Human evaluation results, where we randomly sample 30 examples from the WebQSP w/ Freebase dataset.

Table 10 :
KGQA results with few-shot learning.We vary the number of examples (i.e., shots) in the prompt, and report the performances on the WebQSP w/ Wikidata dataset.generated knowledge model is 67.77, meanwhile, ours is 74.92.However, when we compare the generated knowledge baseline to our KAPING with other LLMs but also with other metrics, our KAP-ING significantly outperforms this baseline.

Table 11 :
(Song et al., 2020)ith different verbalizers.We use the graph-to-text transformation model proposed inMa et al. (2022)for obtaining free-form texts.For triple-form texts, we use the verbalization technique described in Section 3.2.MPNet(Song et al., 2020)is used as the retriever, and the performance is reported on WebQSP w/ Wikidata.

Table 12 :
KGQA results with different verbalizers.We use the graph-to-text transformation model proposed inMa et al.

Table 15 -
Continued from the previous page

Table 16 -
Continued from the previous page