Generated Knowledge Prompting for Commonsense Reasoning

It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models. To investigate this question, we develop generated knowledge prompting, which consists of generating knowledge from a language model, then providing the knowledge as additional input when answering a question. Our method does not require task-specific supervision for knowledge integration, or access to a structured knowledge base, yet it improves performance of large-scale, state-of-the-art models on four commonsense reasoning tasks, achieving state-of-the-art results on numerical commonsense (NumerSense), general commonsense (CommonsenseQA 2.0), and scientific commonsense (QASC) benchmarks. Generated knowledge prompting highlights large-scale language models as flexible sources of external knowledge for improving commonsense reasoning. Our code is available at https://github.com/liujch1998/GKP


Introduction
It remains an open research question whether external knowledge is needed for commonsense reasoning. On one hand, a substantial body of prior work has reported that integrating external knowledge can help improve task performance (Mitra et al., 2019;Bian et al., 2021, inter alia), especially if the knowledge is high quality (e.g. hand-crafted by experts). On the other hand, recent leaderboards are often dominated by large-scale pretrained models that are fine-tuned on a target benchmark (Khashabi et al., 2020;Lourie et al., 2021), suggesting that the benefits of external knowledge may wash away as the underlying models increase in size and are pretrained on ever larger amounts of raw text.
Even if external knowledge is found to be effective on a particular task, flexibility remains a fundamental hurdle to integrating external knowl- Figure 1: Generated knowledge prompting involves (i) using few-shot demonstrations to generate questionrelated knowledge statements from a language model; (ii) using a second language model to make predictions with each knowledge statement, then selecting the highest-confidence prediction. edge, as many benchmarks currently lack appropriate knowledge bases with sufficient coverage. Furthermore, prior methods often require task-specific, custom supervision for knowledge integration (Mitra et al., 2019;Chang et al., 2020), introducing a burden for rapidly adapting new pretrained models to a wide variety of tasks.
In this paper, we investigate whether external knowledge can be helpful for commonsense reasoning, even on top of the largest state-of-the-art pretrained models (e.g. T5-11b (Raffel et al., 2019) and its variants), with a focus on four recent commonsense benchmarks. To facilitate easier adaptation with any zero-shot or finetuned models, we propose an approach that does not require access to a structured knowledge base or joint finetuning for knowledge integration.
The key insight behind our method, Generated Knowledge Prompting (sketched in Figure 1), is that we can generate useful knowledge from a language model, then provide the knowledge as an input prompt that is concatenated with a question. To Part of golf is trying to get a higher point total than others. yes 1.00 | 0.00 The player with the lowest score wins. no 1.00 QASC Sponges eat primarily cartilage 0.95 | 0.00 Sponges eat bacteria and other tiny organisms.
krill and plankton 0.99 Table 1: Examples where prompting with generated knowledge rectifies model prediction. Each section shows the correct answer in green, the incorrect answer in red, and the prediction scores from the inference model that only sees the question (top) and the same model that sees the question prompted with the given knowledge (bottom). support a variety of settings without finetuning, the quality and flexibility of knowledge is crucial. We propose a simple, yet effective, method that elicits knowledge statements (i.e. knowledge expressed as natural language statements) from generic language models in a few-shot setting. Compared to prior work that elicits knowledge via clarification questions (Shwartz et al., 2020) or contrastive explanations (Paranjape et al., 2021), our approach can generate knowledge flexibly, beyond the scope of pre-defined templates (Table 1). Experiments show that our method improves both zero-shot and finetuned models on numerical commonsense (NumerSense (Lin et al., 2020)), general commonsense (CommonsenseQA (Talmor et al., 2019), CommonsenseQA 2.0 (Talmor et al., 2021)), and scientific commonsense (QASC ) benchmarks, setting a new state-ofthe-art on three of these datasets. It outperforms the template-based knowledge generation method self-talk (Shwartz et al., 2020), while performing comparably to retrieval-based systems.
We find three factors contribute to the performance of generated knowledge prompting: (i) the quality of knowledge, (ii) the quantity of knowledge where the performance improves with more knowledge statements, and (iii) the strategy for integrating knowledge during inference. Our qualitative analysis suggests that the generated knowledge statements cover a variety of types, and can transform commonsense question answering to explicit reasoning procedures, e.g. deduction, that are supported by off-the-shelf and finetuned language models.

Generated Knowledge Prompting
A multiple-choice commonsense reasoning task involves predicting an answer a ∈ A q given a ques-tion q ∈ Q, where the set of choices A q is finite and can vary by question, and both questions and answers are variable-length text sequences. Our method answers commonsense questions in two steps.
The first step is knowledge generation, where we use a language model p G (k|q) to generate knowledge statements conditioned on the question: where each knowledge statement k m is a variablelength text sequence. Intuitively, each statement contains information that is helpful for answering the question (e.g. Table 1).
The second step is knowledge integration, where generated knowledge is integrated into the decision process of a language model used for inference, a = arg max a∈Aq p I (a|q, K q ).
In contrast, the vanilla setting of using the inference model without knowledge is represented bŷ a = arg max a∈Aq p I (a|q).
Next, we describe the knowledge generation and integration steps in detail.

Knowledge Generation
We generate question-related knowledge statements by prompting a language model. The prompt consists of an instruction, a few demonstrations that are fixed for each task, and a new-question placeholder. The demonstrations are human-written, and each consists of a question in the style of the task and a knowledge statement that is helpful for answering this question. For a given task, we write five demonstrations using the format in Table 2.
We write questions (or select them from the training set, when available) that are representative of Table 2: Prompts for knowledge generation for two of our tasks, NumerSense and QASC. The prompt consists of an instruction, five demonstrations of question-knowledge pairs, and a new question placeholder. For full prompts on all the tasks we evaluate on, see Appendix A.2. challenges posed by the task (e.g. numerical commonsense, scientific commonsense). We pair each question with a knowledge statement that turns the commonsense problem posed by the question into an explicit reasoning procedure, without directly answering the question. For example, the knowledge statement Birds have two wings. Penguin is a kind of bird. is helpful for the question Penguins have <mask> wings, because it turns the problem into deductive reasoning. Meanwhile, Penguins have two wings. would be a poor knowledge statement to demonstrate according to our guideline.
When generating knowledge for a new question q, we plug the question into the placeholder, and repeatedly sample generated continuations of this prompt to obtain a set of knowledge statements K q = {k 1 , k 2 , . . . , k M }. For full prompts on all the tasks we evaluate on, see Appendix A.2.

Knowledge Integration via Prompting
In the knowledge integration step, we use a language model -called the inference model -to make predictions with each generated knowledge statement, then select the highest-confidence prediction. Specifically, we use each knowledge statement to prompt the model, forming M knowledgeaugmented questions: where [·||·] denotes text concatenation.
We compute an aggregated score for each answer choice a using the augmented question that best supports it under the inference model: Intuitively, this favors knowledge statements that strongly support one of the choices.
The predicted answer is then, which is the choice that gets most support from one of the knowledge statements. This prediction uses a single knowledge statement, which we refer to as the selected knowledge: The inference model may be any existing language model taken off-the-shelf (i.e. zero-shot) or finetuned on the task. We do not do any further finetuning with knowledge prompting.

Experimental Setup
Here, we describe the implementation details of our method and how they are adapted to each task.
For knowledge generation, we use GPT-3 (Brown et al., 2020) as the underlying language model, where our few-shot prompting method is most effective. We generate M = 20 knowledge statements for each question with nucleus sampling p = 0.5 (Holtzman et al., 2019), and discard repetitions and empty strings. Generation is terminated when it exceeds 64 tokens or hits the \n token. 1 For inference, we use off-the-shelf T5 (Raffel et al., 2019) and GPT-3, as well as finetuned models that are state-of-the-art on each dataset, including UnifiedQA (UQA) (Khashabi et al., 2020) and Unicorn (Lourie et al., 2021). See details in the task setup below.

Datasets and Task Setup
We evaluate our method on four commonsense reasoning datasets which cover a variety of challenges and problem formats.
NumerSense (Lin et al., 2020) consists of numerical statements about common objects and concepts where for each sentence we need to recover a masked number word. The choices are integers ranging from zero to ten, plus the word no, so the task can be framed as a multiple-choice problem. Since NumerSense is a diagnostic dataset, we only use zero-shot inference models, which is the current SOTA. We follow Zhang (2021) who uses the state-of-the-art zero-shot T5 with text-infilling setup and select the choice with highest likelihood on its token(s). We also implement zero-shot GPT-3 inference, where we plug in each choice to the question and compute the choice probability as the generative probability of the entire sentence, normalized over all the choices. CommonsenseQA (CSQA) (Talmor et al., 2019) is a 5-way multiple-choice QA dataset about common world scenarios. We do inference with the zero-shot and finetuned T5 models. For zero-shot T5, we format the question as text-infilling, and predict the choice with highest sequence-to-sequence language modeling probability. For finetuned T5 (including UnifiedQA which is SOTA), we use the same setup as Khashabi et al. (2020). CommonsenseQA 2.0 (CSQA2) (Talmor et al., 2021) is a binary classification dataset where we need to judge whether commonsense statements are true or false. We only do inference with the finetuned model, due to poor calibration of zero-shot models on this dataset. We use finetuned Unicorn (Lourie et al., 2021), which is the current SOTA, following the setup in Talmor et al. (2021). QASC  is an 8-way multiplechoice QA dataset about grade school science. This dataset also includes two pieces of background knowledge per question, whose composition fully answers the question. We do inference with zeroshot T5 and finetuned T5 (including UnifiedQA which is SOTA), using the same setups as CSQA.

Inference Model Setup
Since all the inference models we use (T5, Uni-fiedQA, Unicorn) are generative language models, the support to a choice by the inference model is and a i is the i-th token of choice a.

Knowledge Generation Baselines
We study the impact of our knowledge generation method (shorthanded as K) by comparing with the following baselines: No knowledge (∅) We refer to inference without any knowledge statements as the vanilla baseline. Random sentences (R) Sampling random sentences from the language model without conditioning on the question. We use the same implementation setup as our knowledge generation method (i.e. also using GPT-3, with the same hyperparameters).
Context sentences (C) Sampling sentences from the context of the question. This is implemented by sampling text continuations of the question from the language model. We use the same implementation setup as our knowledge generation method.
Template-generated knowledge (T ) Self-talk (Shwartz et al., 2020) uses manually-designed templates to elicit knowledge statements from language models. For fair comparison, we use GPT-3 as the knowledge generator in self-talk, and bound the number of generations to M = 20 per question.
Templates and other hyperparameters are kept the same as their original paper.
Retrieval-based knowledge (IR) Instead of being generated, knowledge can be retrieved from appropriate sources. We consider the following retrieval-based methods. For NumerSense, knowledge is retrieved from sentences in Wikipedia and GenericsKB. For CSQA2, we use snippets returned by Google when querying the question. For QASC, we use the associated fact sentences that are used to create each question. Answers (A) Instead of generating knowledge, GPT-3 can be prompted to generate direct answers to questions. In the prompts, we use the same input questions as those in knowledge generation, while replacing the knowledge statement with the ground truth answer. We consider two baselines: (1) Generate one answer per question and use this to measure the performance of the few-shot GPT-3 inference model; (2) Generate M = 20 answers per question, and use these answers to prompt the SOTA inference models.
datasets we evaluate on, and works well under both zero-shot and finetuned settings. In particular, our knowledge generation outperforms naive baselines as well as template-based knowledge generation, and is on-par with retrieval-based systems. Table 3 shows the results on zero-shot and finetuned models following our task setups. New state-of-the-art. We apply our method on top of the same inference model used in the previous state-of-the-art. On NumerSense, we achieve a 6% (66.18 → 72.47) improvement over the previous best method based on the zero-shot T5 model.  Table 3 indicate that our method consistently improves upon the vanilla baseline set by finetuned inference models (though by smaller margins than in the zero-shot settings). Table 3 reports the performance with different knowledge generation baselines. Generally, random sentences barely help and even hurt the inference model, whereas context sentences of the question provide some gain. In contrast, knowledge generated by our method consistently leads to substantial performance improvements, which implies that our knowledge is of high quality.

Knowledge Generation Methods
Knowledge is an essential factor. The few-shot GPT-3 model is poorly calibrated to directly answer commonsense questions, underperforming our best models by 14% to 20% across all tasks. Even when we use answers generated by few-shot GPT-3 to prompt the SOTA inference models, this still significantly falls behind our method on almost all the tasks and models we consider (with one exception -CSQA with T5 inference). Through the medium of knowledge, our method can effectively leverage useful information possessed by GPT-3 to help improve even the SOTA models on various commonsense reasoning tasks.
Our knowledge outperform template generated knowledge. We compare our knowledge generation method with the template-based self-talk on the CSQA dev set. (CSQA is the only task we experiment with that has self-talk templates available.) Our method leads to a larger improvement over the T5-11b baseline than self-talk (by 1.89%), showing that it is better at eliciting helpful knowl-  edge from models.
Our knowledge is comparable with retrievalbased knowledge. On NumerSense, the retrieved knowledge only improves inference performance by 0.18% on test-core and 1.02% on test-all, while our method further outperforms it by 8.83% and 7.37%, respectively. This shows that knowledge retrieved from a loosely-related knowledge base can be far less useful than our generated knowledge. On CSQA2, although we are not able to beat the web-retrieved knowledge, our method still bridges the performance gap without referring to Google search. For QASC, the "retrieved" knowledge is actually gold knowledge from a knowledge base that was used to construct the dataset. As a result, our generated knowledge falls significantly short of the retrieved knowledge. In summary, our generated knowledge is roughly comparable with retrieved knowledge in terms of downstream performance, and is most valuable when there is no appropriate in-domain knowledge base to retrieve from.

Analysis
Better performance with more knowledge.
We analyze the impact of the number of generated knowledge statements, M , and show the results in Figure 2. Generally, the performance increases with the quantity of knowledge statements. It saturates at M = 20 and begins to decline when more knowledge statements are introduced, which may be because more noisy knowledge is generated.  The knowledge integration method. In addition to the knowledge integration method described in §2.2, we experiment with two alternatives: Mixture-of-Experts (MoE) and Product-of-Experts (PoE) (Hinton, 2002). These make the following modifications to Equation 1, respectively: The results in Table 4 indicate that our knowledge integration method -i.e. adaptively choosing the best knowledge to rely on -is best among the three. Lightweight inference models and amplification. We found that the size of inference model affects the magnitude of improvement. Figure 3 shows the NumerSense performance gain on top of different sizes of inference model. As we use smaller inference models, the performance gain increases drastically. In particular, with our method the smallest T5 model is as powerful as the T5-3b baseline, and T5-large outperforms the GPT-3 baseline. This indicates that model-generated knowledge can enable high performing, yet lightweight, inference models. Furthermore, the improvement does not diminish as the inference model becomes as big as the knowledge generation model, as the inference by GPT-3 can benefit by 9.0% from the knowledge elicited from itself. This indicates that our method can somewhat amplify the useful knowledge already possessed by the model, leading to better predictions.
The size of knowledge generation model. Figure 4 shows the NumerSense performance gain when using different sizes of GPT-3 as the knowledge generation model. On top of the T5-11b inference model, The 6.7B knowledge model gives a 5.0% improvement, narrower than the 10.5% improvement given by the 175B knowledge model. The 1.3B and 0.4B knowledge models do not give a significant improvement. Therefore, we do not necessarily need the largest version of GPT-3 as the knowledge source, though we do need the model to be relatively large in order to generate useful and reliable knowledge.

Human Evaluation
We conduct a human evaluation on NumerSense and QASC to study the quality of generated knowledge and the interpretability of its impact on task performance. Evaluation. We report the quality of knowledge statements along four axes: (1) Grammaticality: whether it is grammatical; (2) Relevance: whether it is relevant to the topic or concepts mentioned on the question; (3) Factuality: whether it is (mostly) factually correct; and (4) Helpfulness: whether it helps answering the question in an either direct or indirect way, and may fall into one of the three categories: helpful (i.e. supports the correct answer), harmful (i.e. negates the correct answer or supports an incorrect answer), or neutral (neither helpful nor harmful). These metrics are adapted from Shwartz et al. (2020) and are defined in Appendix A.3.
From each dataset, we sample up to 50 selected knowledge ( §2.2) that change the correctness of T5-11b's prediction (i.e. rectifies model prediction from wrong to right, or misleads model prediction from right to wrong). The knowledge are labeled by two NLP experts and a moderate level of agreement was reached (Fleiss Kappa κ = 0.57 (Landis and Koch, 1977)). To ensure objectivity, it is not revealed to the annotators whether the knowledge rectifies or misleads the model prediction.
Results. Figure 5 summarizes the results. The vast majority of selected knowledge are grammatical and relevant to the question, and 83% of them are factually correct. 72% are seen as being helpful for answering the question according the human evaluators, whereas 13% are harmful. Out of the knowledge statements that rectify the model predictions, 93% are labeled as helpful by the human evaluators; in contrast, when the knowledge statement misleads the model, only 21% are labeled as helpful, and 39% harmful. Of the knowledge deemed helpful by human and rectifies model prediction, 95% are factual, while of those deemed harmful by human and misleads model prediction, 86% are non-factual, suggesting that improving knowledge factuality is a promising path towards more helpful knowledge. We also analyzed the nonselected knowledge and found that these statements have slightly lower factuality and helpfulness than the selected knowledge.  Table 5: More examples where prompting with generated knowledge reduces the reasoning type and rectifies the prediction. The first row of each section is the original question and the inference results associated with it; the second row is a model-generated knowledge statement that prompts the inference model. We show correct answers in green, incorrect answers in red, and their corresponding scores assigned by the inference model.

Qualitative Examples
rect answer, while with knowledge prompting, the correct answer is assigned a much higher score. Prompting with generated knowledge can transform commonsense reasoning into explicit reasoning procedures such as paraphrasing, induction, deduction, analogy, abductive reasoning, logical elimination, negation, and numerical reasoning.

Related Work
Knowledge can be elicited from pretrained language models. Numerous works have shown that pretrained language models implicitly contain a large amount of knowledge that can be queried via conditional generation (Davison et al., 2019;Petroni et al., 2019;. Consequently, these models can directly perform inference on tasks like commonsense reasoning (Trinh and Le, 2018;, text classification (Shin et al., 2020;Puri and Catanzaro, 2019), and natural language inference (Shin et al., 2020;Schick and Schütze, 2021). Inspired by these observations, we elicit question-related knowledge in an explicit form from language models and use them to guide the inference. Leveraging external knowledge for commonsense reasoning. Some work uses external commonsense knowledge bases to make improvements on various NLP tasks, including commonsense reasoning. One approach is to inject commonsense knowledge into language models, either by pretraining on knowledge bases (Ma et al., 2021;Chang et al., 2020;Mitra et al., 2019;Zhong et al., 2019) or finetuning the model so that it can reason with additional retrieved knowledge (Chang et al., 2020;Mitra et al., 2019;Bian et al., 2021). Another direction is to ground the question into a knowledge graph and do inference with graph-based reasoning (Lin et al., 2019;Lv et al., 2020;Yasunaga et al., 2021).
A common prerequisite of these methods is a high-quality, high-coverage, in-domain commonsense knowledge base (Ma et al., 2019). Some commonsense reasoning datasets are derived from existing knowledge bases; for example, Common-senseQA (Talmor et al., 2019) is derived from ConceptNet (Speer et al., 2017), and Social IQA (Sap et al., 2019b) is derived from ATOMIC (Sap et al., 2019a). For such datasets, it is natural to elicit related knowledge from the underlying knowledge base that derived them, and typically this would demonstrate considerable gains (Mitra et al., 2019;Chang et al., 2020). However, if there is a domain mismatch between the dataset and the knowledge base, such gains tend to diminish (Mi-tra et al., 2019;Ma et al., 2019). This becomes a bottleneck when encountering datasets that have no suitable knowledge base (e.g. NumerSense  and CommonsenseQA 2.0 (Talmor et al., 2021)), or when the system needs to handle commonsense queries that do not fit in any of the commonsense domains represented by an existing knowledge base. Our work overcomes this difficulty by leveraging pretrained language models as the source of commonsense knowledge.
Adding generated text during inference. Recently, several works show that model performance on commonsense reasoning can be boosted by augmenting the question with model-generated text, such as clarifications, explanations, and implications. Self-talk (Shwartz et al., 2020) elicits clarifications to concepts in the question and appends them to the inference model input. Contrastive explanations (Paranjape et al., 2021) prompts inference models with generated explanations that contrast between two answer choices. The aforementioned methods depend on task-specific templates to inquire the generator, which means they are only capable of eliciting a limited variety of knowledge and require careful hand-crafting to transfer to new tasks. Other explanation-based methods (Latcinnik and Berant, 2020;Rajani et al., 2019) finetune the generator model so that it produces explanations that are used for question augmentation. DynaGen  uses pretrained commonsense models to generate implications of a question and builds a dynamic graph of natural language statements on which reasoning is conducted. However, its usage of COMeT (Bosselut et al., 2019) as the generator confines its applicability to the social commonsense domain. Our work contributes to this general line of research, yet different from these previous methods that elicit knowledge with task-specific templates or from finetuned knowledge generators, our method requires only a few human-written demonstrations in the style of the task, making it much more flexible, easy-to-transfer, and engineering-efficient.

Conclusion
We introduce generated knowledge prompting, a simple method to elicit and integrate knowledge from language models so as to improve performance on commonsense reasoning tasks. In particular, we generate knowledge statements by prompting a language model with task-specific, human-written, few-shot demonstrations of questionknowledge pairs. We show that knowledge can be integrated by simply plugging it in at inference time, with no need to finetune the model for knowledge integration. Our method shows effectiveness across multiple datasets, sets the new state-of-theart on three commonsense reasoning tasks, and works under a variety of settings. The method's success highlights language models as sources of flexible, high-quality knowledge for commonsense reasoning. Table 6 summarizes the comparison between our generated knowledge prompting method and prior methods that add generated text to an inference model for commonsense reasoning tasks. Our method is unique because it uses few-shot demonstrations to prompt for knowledge generation, and can apply to finetuned inference models without joint finetuning with knowledge. Table 7 through 10 shows the full prompts for knowledge generation that we use for each evaluated task: NumerSense, CSQA, CSQA2, and QASC. Table 11 and 12 shows the detailed guidelines we use for human evaluation of generated knowledge.

B.1 Limitations and Risks
Limitations. Our method is tested on a representative selection of commonsense reasoning tasks and datasets. Applying this method to other tasks may require people with moderate expertise to craft a task-specific prompt to feed into the method.
Risks. It is possible that our proposed method may lower the performance of commonsense reasoning systems, if not implemented properly or using badly-designed prompts. Such risk can be mitigated by following the prompt design guidelines in this paper ( §2.1).

B.2 Computation
We do not train any new model in this paper. Inference is conducted on Quadro RTX 8000 GPUs and costs about 200 GPU hours in total. Knowledge generation is done with the OpenAI GPT-3 API, with an approximate cost of $500.
Our method is implemented with PyTorch and the Huggingface Transformers library.

Method
Knowledge Generator Inference Model CAGE (Rajani et al., 2019) task-finetuned joint-finetuned Latcinnik and Berant (2020) task-finetuned joint-finetuned DynaGen  task-finetuned joint-finetuned Self-talk (Shwartz et al., 2020) template-prompted 0-shot Contrastive expl. (Paranjape et al., 2021) template-prompted 0-shot & joint-finetuned Generated knowledge prompting (ours) demonstrations-prompted 0-shot & task-finetuned Table 6: Comparison of methods that add generated text to an inference model. Knowledge Generator: taskfinetuned -a model finetuned to generate task-specific knowledge; template-prompted -an off-the-shelf LM from which knowledge statements are elicited via templates; demonstration-prompted -an off-the-shelf LM from which knowledge statements are elicited via few-shot demonstrations ( §2.1). Inference Model: 0-shot -an off-the-shelf LM that is set up to make predictions; task-finetuned -a model finetuned with task training data (and without seeing extra knowledge); joint-finetuned -a model finetuned with task training data and generated knowledge.  Input: Glasses always fog up. Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog. Your lenses will be relatively cool compared to your breath, especially when the outside air is cold.
Input: A fish is capable of thinking. Knowledge: Fish are more intelligent than they appear. In many areas, such as memory, their cognitive powers match or exceed those of 'higher' vertebrates including non-human primates. Fish's long-term memories help them keep track of complex social relationships.
Input: A common effect of smoking lots of cigarettes in one's lifetime is a higher than normal chance of getting lung cancer. Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers.
Input: A rock is the same size as a pebble. Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter).

Input: {question}
Knowledge: Table 9: Prompt for knowledge generation on CSQA2. Demonstration examples are selected from the CSQA2 training set; we use the annotated Google featured snippet as the knowledge.