Prompting Large Language Models with Chain-of-Thought for Few-Shot Knowledge Base Question Generation

The task of Question Generation over Knowledge Bases (KBQG) aims to convert a logical form into a natural language question. For the sake of expensive cost of large-scale question annotation, the methods of KBQG under low-resource scenarios urgently need to be developed. However, current methods heavily rely on annotated data for fine-tuning, which is not well-suited for few-shot question generation. The emergence of Large Language Models (LLMs) has shown their impressive generalization ability in few-shot tasks. Inspired by Chain-of-Thought (CoT) prompting, which is an in-context learning strategy for reasoning, we formulate KBQG task as a reasoning problem, where the generation of a complete question is splitted into a series of sub-question generation. Our proposed prompting method KQG-CoT first retrieves supportive logical forms from the unlabeled data pool taking account of the characteristics of the logical form. Then, we write a prompt to explicit the reasoning chain of generating complicated questions based on the selected demonstrations. To further ensure prompt quality, we extend KQG-CoT into KQG-CoT+ via sorting the logical forms by their complexity. We conduct extensive experiments over three public KBQG datasets. The results demonstrate that our prompting method consistently outperforms other prompting baselines on the evaluated datasets. Remarkably, our KQG-CoT+ method could surpass existing few-shot SoTA results of the PathQuestions dataset by 18.25, 10.72, and 10.18 absolute points on BLEU-4, METEOR, and ROUGE-L, respectively.


Introduction
Question generation task requires a system to produce natural language questions based on the given context.KBQG (Guo et al., 2022) is one of the imperative question generation tasks when the given * *Corresponding author context derived from Knowledge Bases (KBs) is in the form of logical.KBQG has attracted increasing interests from both the industry and academia due to its potential for data augmentation in QA systems (Xiong et al., 2022;Chen et al., 2023) and its ability to assist dialogue systems in creating coherent questions (Lee et al., 2018).
Existing studies (Kumar et al., 2019;Ke et al., 2021;Fei et al., 2022;Guo et al., 2022;Chen et al., 2023) for KBQG tasks have predominantly utilized neural network-based approaches and demonstrated impressive performance by conducting finetuning on extensive training datasets.However, as the collection of KBQG data is labor-intensive, researchers start paying attention to the few-shot KBQG tasks (Xiong et al., 2022), where a great challenge is posed for suppliers with limited resources: 1) A great deal of annotated data is demanded to allow the existing fine-tuned models to generalize well over different logical forms.However, due to the limitations of low-resource availability, training conventional models by fine-tuning on the full data becomes unrealistic.2) A logical form is composed of entities, relations, and query grammar.Having logical forms with various com-arXiv:2310.08395v1[cs.CL] 12 Oct 2023 binations of these basic components is crucial to uphold the model's capability for compositional generalization.The lack of data leads to a compositional challenge to the KBQG tasks (Gu et al., 2021).3) Certain logical forms can become complex when operations such as aggregation, superlatives, and comparisons are involved.Representing these logical forms presents additional challenges.Moreover, developing a KBQG method that incorporates diverse and elaborate expressions becomes particularly difficult in such low-resource scenarios (Xiong et al., 2022;Guo et al., 2022).
Recently, LLMs such as GPT-3 and Codex (Gao et al., 2022;Suzgun et al., 2022;Wei et al., 2022;Wang et al., 2023a) have proven their strong generalizability on a wide range of few-shot and zeroshot tasks with CoT, including text interpretation, computer vision, planning and reasoning.Meanwhile, a line of work (Kasner et al., 2022;Moiseev et al., 2022;Andrus et al., 2022;Trajanoska et al., 2023) validates that LLMs have the strong capability to accurately capture the semantics of relations between values in the data, enabling to transform the structured instructions to narrative text.The above studies inspire us to explore few-shot KBQG tasks by prompting LLMs with CoT.
However, how to apply LLMs to KBQG with CoT is still unclear.On one hand, KBQG differs from tasks like code generation or question answering, as it involves incorporating KB-specific items into the input instead of self-contained narratives.Therefore, formatting the input in an easily understandable manner while considering the KB schema is crucial.On the other hand, the challenge lies in designing effective CoT prompts (Wei et al., 2022) that can enhance the performance of LLMs in the context of few-shot KBQG.
In this work, we propose KQG-CoT framework, which is the first attempt for training-free few-shot KBQG with LLMs.As shown in Figure 1, our framework consists of two main steps, the objects of which are supportive logical forms selection from an unlabeled data pool and prompt construction.To acquire coherent logical forms, we employ a clustering technique to carefully choose multiple logical forms that serve as representatives, considering both their syntactic and semantic characteristics.To construct prompt, inspired by the principle of CoT (Wei et al., 2022), we take the selected logical forms as exemplars and write rationales to split the generation of a complete question into mul-tiple steps.We concatenate the above rationales with the queried logical form to form a prompt, which guides a LLM to outcome a reasoning process of generating a complex question aligning with the logical form.We further improve KQG-CoT to KQG-CoT+ via sorting the supportive logical forms by complexity.
As previous methods rely heavily on the training instances to fine-tune a KBQG model.KQG-CoT does not need numerous logical form question pairs to train the models.We test the performance of our prompting methods under few-shot setting on three public datasets, namely WebQuestions (Kumar et al., 2019), PathQuestions (Zhou et al., 2018), and GrailQA (Gu et al., 2021).We conduct a comprehensive comparison with a range of commonly used CoT baseline methods including Auto-CoT (Zhang et al., 2023c), Active-CoT (Diao et al., 2023), Random-CoT (Brown et al., 2020) and so on.The experimental results show that we can outperform all of them with an observable margin.Besides, we also compare with a set of SoTA systems trained with full data or few data.Our few-shot method could achieve competitive results to the full training methods.Remarkably, our fewshot method could surpass existing few-shot SoTA results of PathQuestions dataset by 18.25, 10.72 and 10.18 absolute points on BLEU-4, METEOR and ROUGE-L, respectively.
KQG-CoT provides a simple but effective solution to few-shot KBQG problem, we expect it could serve as an important baseline for future investigation to KBQG tasks under low-resource scenarios.

Related Work
Knowledge Base Question Generation.The early approaches for KBQG tasks are template-based methods.Berant et al. (2013 andTalmor andBerant (2018a) utilized search engines and manual annotation to construct the natural language questions based on logical forms.However, template-based methods rely on manual intervention, which is hard to be scaled up.With the advancement of deep neural networks, neural network-based methods have emerged as a prominent and widely adopted approach.Kumar et al. (2019) and Chen et al. (2023) proposed end-to-end models based on Transformer and Graph2seq models, which are capable of generating complex, multi-hop questions based on a subgraph.Follow-up studies (Fei et al., 2022;Guo et al., 2022) developed more complicated models for KBQG, which ensure the relevance between the generated questions and subgraphs.Xiong et al. (2022) proposed a method for low-resource KBQG, where an auto-prompter is developed to paraphrase a logical form into a description, so that a pre-trained language model can be fine-tuned with the augmented data.Our work is different from this one as our method focuses on solving few-shot KBQG challenge with frozen LLMs.
Few-shot Learning for Text Generation.In recent years, significant progress has been made in the field of few-shot learning for text generation.One line of work develops meta-learning frameworks for text generation (Mi et al., 2019;Madotto et al., 2019;Zeng et al., 2021;Hospedales et al., 2022), which aims to acquire an optimal initialization that enables accurate and rapid adaptation to a new task, even when limited data is available.Other line of work proposes different augmentation algorithms to synthesize the data for training (Song et al., 2019;Zhao et al., 2022), so that conventional text generation models can be applied to the augmented data.Most recently, LLMs are leveraged to solve few-shot text generation tasks such as text summarization (Yang et al., 2023;Zhang et al., 2023b;Liu et al., 2023), machine translation (Wang et al., 2023b;Hendy et al., 2023), dialogue generation (Zhang et al., 2023a;Valvoda et al., 2022;Kang et al., 2022) and so on.There is no existing study applying LLMs to few-shot KBQG tasks.
In-Context Learning with LLMs.Without gradient updates, In-Context Learning (ICL) effectively tackles a wide range of NLP tasks by incorporating a small number of prompted examples as part of the input (Ruis et al., 2023) to help LLMs understand the tasks.Multiple studies (Su et al., 2022;Rubin et al., 2022) explored the selection of examples that are similar to the query during prompt construction.Recent researches (Lu et al., 2022a;Liu et al., 2022;Diao et al., 2023;Wang et al., 2023c) highlight that the order of these examples in the prompt has a substantial influence.CoT is a prompting strategy decomposing complex tasks into sub-tasks, helping the model to derive the correct answers progressively (Wei et al., 2022;Zhou et al., 2023).It has been widely used in mathematical word problem solving, common-sense reasoning, and symbolic reasoning.Our work incorporates CoT strategy into KBQG tasks, where iterative process enables LLMs to ultimately obtain a complex question aligning with the logical form.

Problem Formulation
A KB consists of a set of triples.A logical form is a structural expression of a subgraph in the KB, which may consist of complex operations (e.g., aggregation, comparative and superlative) and can be utilized to execute against a KB.The task of KBQG requires a system to generate a natural language question when given a logical form and the corresponding KBs with consistent semantics.

Method Overview
Recently, the LLM has shown its impressive incontext few-shot learning capabilities.Instead of fine-tuning a pre-trained model to adapt it to a downstream task, we can simply apply it to a new task with a few examples as prompt during inference (Yang et al., 2022;Li et al., 2023).For the KBQG task, we adopt a two-stage method to design CoT prompts, which effectively enable the LLM to comprehend complex logical forms and generate questions.Concretely, the first stage Supportive Logical Forms Selection focuses on identifying supportive examples that represent various syntax patterns of logical forms.To accomplish this, we encode the structure of logical forms, perform clustering, and employ sampling techniques to select top-k supportive logical forms.Once these supportive examples are selected, we leverage LLMs with CoT prompts to generate natural language questions.This leads us to the second stage, Prompt Construction, which involves producing sub-questions as rationales.Through this process, we can ultimately formulate a complex question that adequately captures the semantic of the logical form.A schematic diagram of our method is displayed in Figure 2.

Supportive Logical Forms Selection
Zhang et al. (2023c) has shown that when constructing demonstrations, we need to mitigate the effect of few-shot CoT errors by differentiating the design of demonstrations.In KBQG tasks, supportive logical forms are those that can cover diverse logical rules, so as to offer more syntax information for LLMs to generate questions.Unlike the narrative inputs, the logical form is a combination of program structures and schema items (i.e., entities and relations).Therefore, it is essential to take both aspects into consideration when selecting supportive logical forms.In our approach, we utilize Structure Encoding and Clustering, followed by a Logical Form Sampling process to select supportive logical forms.Structure Encoding and Clustering.To ensure the logical forms can be drafted for unseen questions, we extract their structures by converting the schema items into symbolic variables.Specifically, we keep the grammars in the logical form unchangeable.Then, we replace the relation with symbol "r" and we replace the entity with "e".This structure is also known as a abstract query graph (Chen et al., 2021), which reflects the topology and the component classes of logical forms.For instance, the raw logical form is: (AND medicine.routed_drug It becomes the following structure after conversion: (AND r (JOIN r e)).
Once we have obtained the structure of the logical forms, which filters out the semantic meaning of the logical forms.We encode the structure representation into a fix-length embedding.In detail, we view the structure as a sequence of tokens.We encode the contexts of the sequence with Sentence-Transformers (Reimers and Gurevych, 2019), which is an advanced model for text embedding.The encoded vectors are well-suited for calculating the similarity between sentences.We extract the final hidden state of as the vectorized representation of the sentence.After that, we utilize the K-means (Hartigan and Wong, 1979) clustering algorithm to group the encoded structure into k clusters based on their syntactic similarity.
Logical Form Sampling.Each cluster contains a group of logical forms with the similar structure, we randomly pick up a structure from each group and obtain k representative structures.As each structure may correspond to multiple logical forms.We further identify k logical forms with distinct semantics deriving from the k selected structures.To this end, we iteratively sample logical forms holding the maximum diversity of semantics.Specifically, for the first logical form, we randomly pick up one from the candidates.Then we search logical forms for another structure.We greedily pick up a candidate with least semantic similarity to the selected logical forms, where the similarity is measured by the encoding of the original logical forms.We repeat the process until we have gone through k structures as shown in Figure 2.
To help the LLMs fully understand the logical forms, we substitute the entities in the original logical forms with their surface names in the KB.In this way, we obtain k supportive logical forms.

Prompt Construction
Since some logical forms have complicated semantics and even nested syntactic structures are included.Following the CoT method, we construct a reasoning chain prompt based on the supportive logical forms retrieved above.For each example, we need to generate a reasoning chain based on logical forms to elicit LLMs generate questions from simple to complicated.To this end, we hold two criteria when constructing reasoning chains: (i) The templates should break up the generation of a complicated question into a step by step process.
(ii) The templates should clearly identify the subcomponent in a logical form that requires LLMs to focus on for each step.
Therefore, we first break down a logical form in a nested manner, where the follow-up logical forms include the preceding logical forms.Specifically, the first step usually generates a simple question querying one-hop relation from the topic entity.The second step usually generates a question querying two-hop relation chain involving the above one-hop relation.As we can see from Figure 2, the first step of prompt parses the entire logical form into one-hop relation subgraph1 "(AND sports.sport.team_coachesJohn Russo)" which leads to a simple subques-tion1 "sport team coach john russo ".The second step includes the parsed logical form appended to the previous step as a component and generates question "Which sport does john russo coach?" based on the subgraph2 and subquestion1.As a result, we continuously expand the logical form until a complete question is formed.This step-bystep process ensures that the generated question is semantically coherent and grammatically accurate.
During inference, we concatenate all the demonstrations and queried logical form as the final prompt.Based on the example in Figure 2 last subquestion will be our final predicted question, which is "What is the number of aircraft manufacturer in the legal structure of s.a.?".

Experiment
In this section, we first introduce the KBQG datasets used to evaluate the performance of our proposed method and the comparable baseline methods.Next, we present the implementation details and demonstrate the experimental results.

Data and Metrics
We evaluate our prompting method on the following three public datasets: WebQuestions (WQ) (Kumar et al., 2019) 1 is a KBQG dataset combining instances from We-bQuestionsSP (Serban et al., 2016) and Com-plexWebQuestions (Talmor and Berant, 2018b).It provides questions, answers, and annotated subgraphs.This dataset is commonly evaluated in existing work (Guo et al., 2022).PathQuestions (PQ) (Zhou et al., 2018) 2 is a commonly used KBQG dataset constructed from a KBQA dataset.It contains questions inquiring a chain of relations, wherein the path between the topic entities and answer entities is 2-hop or 3-hop.GrailQA (GQ) (Gu et al., 2021) 3 is a large-scale KBQA dataset built on Freebase, which covers 86 domains.It covers complex questions which require counting, ranking and even superlative inquiry.Each question is associated with a sexpression, which can be viewed as a logic form.We collect the annotated the logic form from the training set as the data pool and leave the original questions untouched.The questions in the validation or test set are sampled to evaluate our method.Statistics of evaluated datasets are shown in Table 1.
Following previous KBQG studies, we rely on a set of well-established metrics as for KBQG evaluation: BLEU-4 (Papineni et al., 2002), ME-TEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004).BLEU-4 and ROUGE-L can be viewed as precision and recall for text generation tasks, respectively.METEOR is a comprehensive metric beyond exact matches, which also accounts for partial matches and variations in word order.We denote them as B, M and R, respectively.

Comparable Methods
We denote our prompting method as KQG-CoT.Previous studies (Lu et al., 2022b) have proven that the order of the exemplars is significant to the prompt results, we implement an improved version by sorting the demonstrations from short to long after sampling.We denote this method as KQG-CoT+.
As there is no existing attempt for few-shot KBQG tasks with LLMs, we adopt five general prompting methods under few-shot scenarios as our baselines.Standard Prompt (Brown et al., 2020) is a standard prompting method of in-context learning, where k random logical forms and questions are concatenated to form the prompt.The prediction is one-step generation.Random-CoT is an intuitive CoT prompting baseline where k logical forms are randomly selected from the data pool and we follow the original work (Brown et al., 2020) to describe the sub-task in a narrative.Manual-CoT (Wei et al., 2022) is a CoT prompting with k human-written exemplars as demonstrations and the sub-task is presented in narratives.Active-CoT (Diao et al., 2023) is an ensemble framework for CoT prompting.The multiple logi-cal forms are randomly selected as a validation set.Then multiple measurements (e.g., disagreement, variance) are leveraged as the uncertainty value for each logical form to produce the final question.Auto-CoT (Zhang et al., 2023c) automatically constructs prompt by selecting k demonstrations with a cluster-based algorithm and the sub-task is presented in narratives.We simply adopt the prompting method to KBQG tasks by encoding all logical form in a textual way.

Implementation Details
For encoding of logical forms, we utilize all-MiniLM-L6-v24 checkpoint from the Sentence-Transformers library in Huggingface for effective encoding.As this is a few-shot scenario, we manually write the rationales for the k demonstrations in the chain prompt.We utilize text-davinci-003 from OpenAI API5 to generate questions and set the number of clusters as k = 126 .

Main Results
Comparison with Baselines.Table 2 showcases the experimental results of our methods and baseline approaches.We have the following observations based on it: 1) Comparing all CoT prompting methods, in the few-shot setting, our KQG-CoT+ prompting consistently outperforms other method across all KBQG datasets by a remarkable margin.Specifically, KQG-CoT+ improves the performance of the competitive Auto-CoT by 0.72 to 2.12 absolute values for all datasets.Meanwhile, KQG-CoT also outperforms existing CoT prompting methods on BLEU-4 of all the datasets.2) Comparing CoT methods with standard prompting, we notice that all the CoT prompting methods outperform the standard prompting method, which indicates that, to generate questions with complex logic and long dependency, splitting the entire generation task into sub-tasks are crucial for maintaining the coherence and accuracy of the questions.
3) Comparing Auto-CoT, KQG-CoT and KQG-CoT+, even though all these methods adapt clustering to select k demonstrations, KQG-CoT and KQG-CoT+ are more effective as we elaborately design encoding algorithm and prompt templates for KBQG tasks, which makes it fit more into the question generation from the logical forms.
Comparison with Other Systems.We further compare our prompting methods with other KBQG systems on the WQ and PQ datasets.According to our knowledge, we are the first to work on the KBQG task using the GQ dataset, so there are no existing methods available for comparison.
In Table 3, we can see that with 12 demonstrations, our method can outperform majority of fulltrained systems on WQ dataset, where all training data is leveraged to train a model.KQG-CoT+ prompting method can achieve 29.73%, 31.08% and 55.46% for BLEU-4, ROUGE-L and ME-TEOR respectively, which are close to the SoTA results.
In Table 4, we can see that for PQ dataset, our method can still achieve better results than most of existing full-trained KBQG models.Compared with existing methods under few-shot settings, our methods can significantly improve the BLEU-4 over AutoQGS by around 20 absolute points.It is worth noting that AutoQGS takes 0.1% training instances for training and we simply leverage 12 instances for inference, which highlights superiority of our methods.

More Analysis
Human Evaluation.We further conduct human evaluation by randomly sampling 300 examples from the test set of WQ dataset.The generated questions are rated on a scale of 1 to 5 considering the aspects of syntactic correctness, complexity, and relevance to the given logical forms.We ask three annotators to score the generated questions with 1-point being poor and 5-point being perfect.The score of each question is averaged over all annotators.We present the results in Table 6, where we can observe a similar trend between human and automatic evaluation.Our approach outperforms all comparable methods, the evaluated scores of which are close to the ground truth.Ablation Study.We conduct ablation study to assess the effectiveness of components of our model and display the results in Table 7.We first exclude the CoT reasoning chain, and observe a performance drop of the evaluate metrics.This indicates that CoT plays an important role in generating complicated questions.Then we remove the K-means algorithm and randomly select supportive logical forms.The decrease of the results indicates that our clustering algorithm could provide more diverse logical forms as our demonstrations.We further encode the entire logical forms without extracting their structures.The results decrease which indi-   cate that the structure is a significant indicator to obtain the clusters7 .Effect of k.We investigate the effect of k in Figure 3.As observed, with an increase of the number of demonstrations, both our methods and Random-CoT show increasing BLEU-4 and ROUGE-L scores.This indicates that the number of demonstrations is significant in activate the potentials of LLMs.Compared with Random-CoT, our method shows a larger gain when the value of k becomes large, this indicates that our methods indeed pick up the most representative logical form as the demonstrations.Case Study.To provide a comprehensive compar- ison between KQG-CoT+ method and the baseline models on GQ dataset, we present multiple example cases in Table 5.Our method elicits the intermediate generation steps and provides more guidance to LLMs so that our KQG-CoT+ generates questions that are grammatically correct and semantically close to the given logical form.In contrast, baseline methods may encounter issues such as inconsistency in the logical form, misplaced modifiers, or unsmooth expressions8 .

Conclusion
In this paper, we presented the KQG-CoT approach to tackle few-shot KBQG tasks.KQG-CoT retrieves relevant logical forms from unlabeled data and incorporates their characteristics.It then generates explicit prompt to showcase the reasoning process for complex question generation based on the selected examples.Experimental results demonstrate that our approach achieves state-of-the-art performance compared to baselines and even shows competitive results to full-training methods.

A Appendix
A.1 Ablation Study on More Datasets We display Table 9 to show more ablation studies on WQ and PQ datasets.We can also recognize the significance of our CoT reasoning chain, K-means algorithm, and structure encoding.

A.2 Illustrative Examples of KQG-CoT+ Prompt
We present a selection of illustrative examples showcasing our proposed prompts and predictions on WQ, GQ, and PQ in Table 13, Table 14 and Table 15, respectively.As

A.3 Detailed Prompt Design of KQG-CoT+
To enhance the guidance provided to LLM in question generation, we have included a descriptive sentence in the demonstrations, which states: "Let's engage in a step-by-step exercise of generating questions from logical forms.We have provided several examples, each comprising an 'Input' logical form and a corresponding 'Subquestion' that we aim to generate.By deconstructing the input logical form into basic components, we can generate questions iteratively until we get the final question.For each 'Subgraph', we can construct a relevant 'Subquestion' phrase to assist in generating the subsequent question in the sequence.".

A.4 Effect of Demonstration Order
During the experiment, we made a noteworthy observation regarding the impact of demonstration order on the performance of our method.We conducted a comprehensive exploration of various sorting techniques, including uncertainty-based sorting (Diao et al., 2023)

A.5 KQG-CoT Improve KBQA Task
To confirm the efficacy of our approach in improving the performance of KBQA methods, we initiated a data augmentation process on the WebQuestions dataset.It's worth noting that the augmented dataset was only half the size of the original dataset.Subsequently, we embarked on the training of an exceptionally potent open-source KBQA method known as RnG-KBQA, utilizing the amalgamation of the augmented and original datasets.This retrained version of the method is denoted as RnG-KBQA+.The methodology encompassed the subsequent steps: Initially, we extracted the topic entity from each question in the original dataset.Next, we focused on the topic entity and extracted relevant logical forms.During this extraction process, we aimed to include relations that were not present in the original dataset whenever possible.Finally, we combined the augmented data with the original data to train the model.The resulting is detailed in Table 10.
From the data in the table above, it can be observed that we only performed a very simple augmentation on a small portion of the data.However, the F1 score of the original KBQA method increased by 2.8%.This clearly demonstrates that our proposed KBQG method provides significant assistance to downstream KBQA tasks.Table 10: The result of our approach in improving the performance of KBQA methods.

A.6 The Effectiveness of the Proposed Structured Encoding and Clustering
To demonstrate the effectiveness of the proposed Structured Encoding and Clustering in selecting diverse structures, we conducted a quantitative assessment of the average semantic similarity between the logical forms extracted using our method and the baseline method at K=8 on the GrailQA dataset.
The results are presented in Table 11.
The data from the initial segment, shown in the table below, reveals that the logical forms chosen by our method exhibit a lower average semantic similarity.When viewed collectively, these findings offer strong evidence for the efficacy of our proposed approach.

Figure 2 :
Figure2: KQG-CoT framework.The supportive logical forms are selected from an unlabeled data pool by extracting the structures, clustering the structures and sampling the most representative ones.A total of k demonstrations are automatically constructed using reasoning chains.The tested logical form is appended to the demonstrations to form the complete prompt, which can elicit the LLM to generate a series of subquestions sequentially from simple to complicated.Finally, the last subquestion can be extracted as the final prediction.

Figure 3 :
Figure 3: The BLEU-4 and ROUGE-L scores of our method and Random-CoT with increasing number of shots on GQ.

Table 1 :
, the prompt includes "Input: (AND ... Input: (JOIN ... Input: (COUNT ... S.A.".After receiving the prompt, LLMs outcome the predictions that clarifies the intermediate generation steps of subques-tion1, subquestion2, and subquestion3.And the Statistics of the evaluated datasets.#Q denotes the number of questions.#R and #E denote the total number of relations and entities, respectively.#T denotes the minimum/maximum/average number of triplets involved in each question.

Table 2 :
Few-shot evaluation of existing prompting methods with Frozen LLMs on three KBQG datasets.The best and second best results are boldfaced and underlined respectively.

Table 4 :
Comparison between few-shot evaluation of KQG-CoT/KQG-CoT+ and few-shot/full-trained evaluation of other systems on PQ.

Table 5 :
Illustrative examples from KQG-CoT+ and baseline methods on GQ.

Table 7 :
Ablation study of our KQG-CoT+ method on GQ.
, random sorting, and sorting based on the number of logical form jumps.The detailed experimental results are presented in Table8.It becomes evident that arranging the demonstrations in ascending order of the number of logical form jumps leads to the most favorable outcomes.This finding highlights the structural complexity of logical forms when organizing the demonstrations.

Table 8 :
The results of using different sorting methods for demonstrations on the GQ dataset are as follows: Our KQG-CoT+ method is sorted in ascending order of the number of logical form jumps. Random sorting is done randomly.L2S sorting is performed in ascending order of length.Uncertainty sorting is based on descending order of uncertainty values.Lastly, similarity sorting is based on descending order of similarity values between the logical forms of demonstrations and tests.

Table 11 :
The average semantic similarity between the logical forms of different methods.A.7 The Impact of the Sorted Order of DemonstrationsTo assess the impact of the sorted order of demonstrations in KQG-CoT+, we compared the performance of Auto-CoT and Active-CoT using the same sorted order of demonstrations in KQG-CoT+ (i.e., Auto-CoT+ and Active-CoT+) and conducted experiments on the GrailQA dataset .The table12shows that, compared to the Active-CoT+ and Auto-CoT+ methods, our proposed KQG-CoT+ method still exhibits significant improvements.