KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models

While large language models (LLMs) have made considerable advancements in understanding and generating unstructured text, their application in structured data remains underexplored. Particularly, using LLMs for complex reasoning tasks on knowledge graphs (KGs) remains largely untouched. To address this, we propose KG-GPT, a multi-purpose framework leveraging LLMs for tasks employing KGs. KG-GPT comprises three steps: Sentence Segmentation, Graph Retrieval, and Inference, each aimed at partitioning sentences, retrieving relevant graph components, and deriving logical conclusions, respectively. We evaluate KG-GPT using KG-based fact verification and KGQA benchmarks, with the model showing competitive and robust performance, even outperforming several fully-supervised models. Our work, therefore, marks a significant step in unifying structured and unstructured data processing within the realm of LLMs.


Introduction
The remarkable advancements in large language models (LLMs) have notably caught the eye of scholars conducting research in the field of natural language processing (NLP) (Brown et al., 2020;Chowdhery et al., 2022;OpenAI, 2023a,b;Anil et al., 2023).In their endeavor to create LLMs that can mirror the reasoning capabilities inherent to humans, past studies have primarily centered their attention on unstructured textual data.This includes, but is not limited to, mathematical word problems (Miao et al., 2020;Cobbe et al., 2021;Patel et al., 2021), CSQA (Talmor et al., 2019), and symbolic manipulation (Wei et al., 2022).While significant strides have been made in this area, the domain of structured data remains largely unexplored.
Structured data, particularly in the form of knowledge graphs (KGs), serves as a reservoir of interconnected factual information and associations, articulated through nodes and edges.The inherent structure of KGs offers a valuable resource that can assist in executing complex reasoning tasks, like multi-hop inference.Even with these advantages, to the best of our knowledge, there is no general framework for performing KG-based tasks (e.g.question answering, fact verification) using auto-regressive LLMs.
To this end, we propose a new general framework, called KG-GPT, that uses LLMs' reasoning capabilities to perform KG-based tasks.KG-GPT is similar to StructGPT (Jiang et al., 2023) in that both reason on structured data using LLMs.However, unlike StructGPT which identifies paths from a seed entity to the final answer entity within KGs, KG-GPT retrieves the entire sub-graph and then infers the answer.This means KG-GPT can be used not only for KGQA but also for tasks like KG-based fact verification.
KG-GPT consists of three steps: 1) Sentence (Claim / Question) Segmentation, 2) Graph Retrieval, and 3) Inference.During Sentence Segmentation, a sentence is partitioned into discrete sub-sentences, each aligned with a single triple (i.e.[head, relation, tail]).The subsequent step, namely Graph Retrieval, retrieves a potential pool of relations that could bridge the entities identified within the sub-sentences.Then, a candidate pool of evidence graphs (i.e.sub-KG) is obtained using the retrieved relations and the entity set.In the final step, the obtained graphs are used to derive a logical conclusion, such as validating a given claim or answering a given question.
To evaluate KG-GPT, we employ KG-based fact verification and KGQA benchmarks, both demanding complex reasoning that utilizes structured knowledge of KGs.In KG-based fact verification, we use FACTKG (Kim et al., 2023), which includes various graph reasoning patterns, and KG-GPT shows competitive performance compared to other fully-supervised models, even outperforming some.In KGQA, we use MetaQA (Zhang et al., 2018), a QA dataset composed of 1-hop, 2-hop, and 3-hop inference tasks.KG-GPT shows performance comparable to fully-supervised models.Notably, the performance does not significantly decline with the increase in the number of hops, demonstrating its robustness.

Method
KG-GPT is composed of three stages: Sentence Segmentation, Graph Retrieval, and Inference, as described in Fig. 1.
We assume a graph G (knowledge graph consisting of entities E and relations R), a sentence S (claim or question), and all entities involved in S, E S ⊂ E are given.In order to derive a logical conclusion, we need an accurate evidence graph G E ⊂ G, which we obtain in two stages, Sentence Segmentation and Graph Retrieval.Furthermore, all the aforementioned steps are executed employ-ing the in-context learning methodology to maximize the LLM's reasoning ability.The prompts used for each stage are in Appendix A.

Sentence Segmentation
Many KG-based tasks require multi-hop reasoning.
To address this, we utilize a Divide-and-Conquer approach.By breaking down a sentence into subsentences that correspond to a single relation, identifying relations in each sub-sentence becomes easier than finding n-hop relations connected to an entity from the original sentence all at once.We assume S can be broken down into subsentences: S 1 , S 2 , ..., S n where S i consists of a set of entities E i ⊂ E and a relation r i ∈ R. Each e (j) i ∈ E i can be a concrete entity (e.g.William Anders in Fig. 1-(1)), or a type (e.g.artificial satellite in Fig. 1-(1)).r i can be mapped to one or more items in R, as there can be multiple relations with similar semantics (e.g.birthPlace, placeOfBirth).

Graph Retrieval
To effectively validate a claim or answer a question, it is crucial to obtain an evidence graph (i.e.sub-KG) that facilitates logical conclusions.In this stage, we first aim to retrieve the corresponding relations for each sub-sentence S i to extract G E .
For each S i , we use the LLM to map r i to one or more items in R as accurately as possible.To do so, we first define R i ⊂ R, which is a set of relations connected to all e (j) i ∈ E i according to the schema of G (i.e.relation candidates in Fig. 1-(2)).This process considers both the relations connected to a specific entity and the relations associated with the entity's type in G.We further elaborate on the process in Appendix B. Then, we feed S i and R i to the LLM to retrieve the set of final top-K relations R i,k .In detail, relations in R i are linearized (e.g.[location, birthYear, ..., birthDate]) and combined with the corresponding sub-sentence S i to establish prompts for the LLM and the LLM generates i } as output.In the final graph retrieval step, we can obtain G E , made up of all triples whose relations come from R i,k and whose entities come from E i across all S i .

Inference
Then, we feed S and G E to the LLM to derive a logical conclusion.In order to represent G E in the prompt, we linearize the triples associated with G E (i.e.[[head 1 , rel 1 , tail 1 ], ..., [head m , rel m , tail m ]]), and then concatenate these linearized triples with the sentence S. In fact verification, the determination of whether S is supported or refuted is contingent upon G E .In question answering, the LLM identifies an entity in G E as the most probable answer to S.

Experiments
We evaluate our framework on two tasks that require KG grounding: fact-verification and questionanswering.A detailed description of experimental settings can be found in Appendix C.

FACTKG
FACTKG (Kim et al., 2023) serves as a practical and challenging dataset meticulously constructed for the purpose of fact verification, employing a knowledge graph for validation purposes.It encompasses 108K claims that can be verified via DBpedia (Lehmann et al., 2015), which is one of the available comprehensive knowledge graphs.These claims are categorized as either Supported or Refuted.FACTKG embodies five diverse types of reasoning that represent the intrinsic characteristics of the KG: One-hop, Conjunction, Existence, Multi-hop, and Negation.To further enhance its practical use, FACTKG integrates claims in both colloquial and written styles.Examples of claims from FACTKG can be found in Appendix D.

MetaQA
MetaQA (Zhang et al., 2018) is a carefully curated dataset intended to facilitate the study of questionanswering that leverages KG-based approaches in the field of movies.The dataset encompasses over 400K questions, including instances of 1-hop, 2hop, and 3-hop reasoning.Additionally, it covers a diverse range of question styles.Examples of questions from MetaQA can be found in Appendix D.

Baselines
For evaluation on FACTKG, we use the same baselines as in Kim et al. (2023).These baselines are divided into two distinct categories: Claim Only and With Evidence.In the Claim Only setting, the models are provided only with the claim as their input and predict the label.For this setting, in addition to the existing baselines, we implement a 12-shot ChatGPT (OpenAI, 2023b) baseline.In the With Evidence scenario, models consist of an evidence graph retriever and a claim verification model.We employ the KG version of GEAR (Zhou et al., 2019) as a fully supervised model.

FACTKG
We evaluated the models' prediction capability for labels (i.e.Supported or Refuted) and presented the accuracy score in Table 1.As a result, KG-GPT outperforms Claim Only models BERT, BlueBERT, and Flan-T5 with performance enhancements of 7.48%, 12.75%, and 9.98% absolute, respectively.It also outperforms 12-shot ChatGPT by 4.20%.These figures emphasize the effectiveness of our framework in extracting the necessary evidence for claim verification, highlighting the positive impact Table 2: The performance of the models on MetaQA (Hits@1).The best results for each task and those of 12-shot KG-GPT are in bold.
of the sentence segmentation and graph retrieval stages.The qualitative results including the graphs retrieved by KG-GPT are in Appendix E.1.Nonetheless, when compared to GEAR, a fully supervised model built upon KGs, KG-GPT exhibits certain limitations.KG-GPT achieves an accuracy score of 72.68%, which is behind GEAR's 77.65%.This performance gap illustrates the obstacles encountered by KG-GPT in a few-shot scenario, namely the difficulty in amassing a sufficient volume of information from the restricted data available.Hence, despite the notable progress achieved with KG-GPT, there is clear room for improvement to equal or surpass the performance of KG-specific supervised models like GEAR.

MetaQA
The findings on MetaQA are presented in Table 2.The performance of KG-GPT is impressive, scoring 96.3%, 94.4%, and 94.0% on 1-hop, 2-hop, and 3hop tasks respectively.This demonstrates its strong ability to generalize from a limited number of examples, a critical trait when handling real-world applications with varying degrees of complexity.Interestingly, the performance of KG-GPT closely matches that of a fully-supervised model.Particularly, it surpasses KV-Mem by margins of 0.1%, 11.7%, and 45.1% across three distinct tasks respectively, signifying its superior performance.While the overall performance of KG-GPT is similar to that of GraftNet, a noteworthy difference is pronounced in the 3-hop task, wherein KG-GPT outperforms GraftNet by 16.3%.The qualitative results including the graphs retrieved by KG-GPT are in Appendix E.2.

Error Analysis
In both FACTKG and MetaQA, there are no corresponding ground truth graphs containing seed entities.This absence makes a quantitative step-bystep analysis challenging.Therefore, we carried out an error analysis, extracting 100 incorrect samples from each dataset: FACTKG, MetaQA-1hop, MetaQA-2hop, and MetaQA-3hop.Table 3 shows the number of errors observed at each step.Notably, errors during the graph retrieval phase are the fewest among the three steps.This suggests that once sentences are correctly segmented, identifying relations within them becomes relatively easy.Furthermore, a comparative analysis between MetaQA-1hop, MetaQA-2hop, and MetaQA-3hop indicates that as the number of hops increases, so does the diversity of the questions.This heightened diversity in turn escalates the errors in Sentence Segmentation.

Number of In-context Examples
The results for the 12-shot, 8-shot, and 4-shot from the FACTKG and MetaQA datasets are reported in Table 1 and Table 2, respectively.Though there was a predicted improvement in performance with the increase in the number of shots in both FACTKG and MetaQA datasets, this was not uniformly observed across all scenarios.Notably, MetaQA demonstrated superior performance, ex-ceeding 90%, in both the 1-hop and 2-hop scenarios, even with a minimal set of four examples.In contrast, in both the FACTKG and MetaQA 3-hop scenarios, the performance of the 4-shot learning scenario was similar to that of the baselines which did not utilize graph evidence.This similarity suggests that LLMs may struggle to interpret complex data features when equipped with only four shots.Thus, the findings highlight the importance of formulating in-context examples according to the complexity of the task.

Top-K Relation Retrieval
Table 11 shows the performance according to the value of k in FACTKG.As a result, performance did not significantly vary depending on the value of k.Table 12 illustrates the average number of triples retrieved for both supported and refuted claims, depending on k.Despite the increase in the number of triples as the value of k grows, it does not impact the accuracy.This suggests that the additional triples are not significantly influential.
In MetaQA, the performance and the average number of retrieved triples are also depicted in Table 13 and Table 14, respectively.Unlike the FACTKG experiment, as the value of k rises in MetaQA, it appears that more significant triples are retrieved, leading to improved performance.

Conclusion
We suggest KG-GPT, a versatile framework that utilizes LLMs for tasks that use KGs.KG-GPT is divided into three stages: Sentence Segmentation, Graph Retrieval, and Inference, each designed for breaking down sentences, sourcing related graph elements, and reaching reasoned outcomes, respectively.We assess KG-GPT's efficacy using KGbased fact verification and KGQA metrics, and the model demonstrates consistent, impressive results, even surpassing a number of fully-supervised models.Consequently, our research signifies a substantial advancement in combining structured and unstructured data management in the LLMs' context.

Limitations
Our study has two key limitations.Firstly, KG-GPT is highly dependent on in-context learning, and its performance varies significantly with the number of examples provided.The framework struggles particularly with complex tasks when there are in-sufficient or low-quality examples.Secondly, despite its impressive performance in fact-verification and question-answering tasks, KG-GPT still lags behind fully supervised KG-specific models.The gap in performance highlights the challenges faced by KG-GPT in a few-shot learning scenario due to limited data.Future research should focus on optimizing language models leveraging KGs to overcome these limitations.

A Prompts
The prompts for Sentence Segmentation, Graph Retrieval, and Inference can be found in Table 4, Table 5 and Table 6, respectively.

B Relation Candidates Extraction Algorithm B.1 FACTKG
In FACTKG, we develop a new KG called Type-DBpedia.This graph comprises types found in DBpedia and connects them through relations, thereby enhancing the usability of KG content.We describe the detailed process of incorporating R i using DBpedia and TypeDBpedia in Algorithm 1.
We denote the entities as E 1 and E 2 because the sub-sentence always includes two entities in FAC-TKG.Relations (e, DBpedia) represents the set of relations connected to e in DBpedia.Similarly, Relations (T , T ypeDBpedia) represents the set of relations connected to T in T ypeDBpedia.

Algorithm 1: Extract Relation Candidates
Input: For the n-hop task in MetaQA, R i is constructed from the relations within n-hops from the seed entity.

C Experimental Settings
We utilize ChatGPT2 (OpenAI, 2023b) across all tasks, and to acquire more consistent responses, we carry out inference with the temperature and top_p parameters set to 0.2 and 0.1, respectively.For each stage of KG-GPT, 12 pieces of training samples were made into in-context examples and added to the prompt.In FACTKG, there are over 500 existing relations (|R| > 500), so we set k = 5 for Top-K relation retrieval.Conversely, in MetaQA, there are only 9 existing relations (|R| = 9), so we set k = 3.

D Data Examples
Examples of data from FACTKG and MetaQA can be found in Tables 7 and 8, respectively.

E.1 FACTKG
Table 9 includes the graphs retrieved by KG-GPT, along with the prediction results, for five different claims.

E.2 MetaQA
Table 10 includes the graphs retrieved by KG-GPT, along with the prediction results, for nine different questions.

F Top-K Relation Retrieval F.1 FACTKG
The performance and the average number of retrieved triples are depicted in Table 11 and Table 12, respectively.

F.2 MetaQA
The performance and the average number of retrieved triples are depicted in Table 13 and Table 14, respectively.
Sentence Segmentation Prompt Please divide the given sentence into several sentences each of which can be represented by one triplet.The generated sentences should be numbered and formatted as follows: #(number).(sentence), (entity set).The entity set for each sentence should contain no more than two entities, with each entity being used only once in all statements.The '##' symbol should be used to indicate an entity set.In the generated sentences, there cannot be more than two entities in the entity set.(i.e., the number of ## must not be larger than two.)', 'clubs', 'parent', 'spouse', 'birthPlace', 'deathYear', 'leaderName', 'awards', 'award', 'vicepresident', 'vicePresident Inference Prompt You should verify the claim based on the evidence set.Each evidence is in the form of [head, relation, tail] and it means "head's relation is tail.".

Examples
Verify the claim based on the evidence set.(True means that everything contained in the claim is supported by the evidence.)Please note that the unit is not important.(e.g."98400" is also same as 98.4kg) Choose one of {True, False}, and give me the one-sentence evidence.

Figure 1 :
Figure 1: An overview of KG-GPT.The framework comprises three distinct phases: Sentence Segmentation, Graph Retrieval, and Inference.The given example comes from FACTKG.It involves a 2-hop inference from 'William_Anders' to 'Frank_Borman', requiring verification through an evidence graph consisting of three triples.Both 'William_Anders' and 'Frank_Borman' serve as internal nodes in DBpedia(Lehmann et al., 2015), while "AFIT, M.S. 1962" acts as a leaf node.Moreover, artificial satellite represents the Type information absent from the provided entity set.

Table 5 :
Relation Retrieval Prompt.The prompt is used when retrieving a relation to retrieve a graph.