Knowledge Rumination for Pre-trained Language Models

Previous studies have revealed that vanilla pre-trained language models (PLMs) lack the capacity to handle knowledge-intensive NLP tasks alone; thus, several works have attempted to integrate external knowledge into PLMs. However, despite the promising outcome, we empirically observe that PLMs may have already encoded rich knowledge in their pre-trained parameters but fail to fully utilize them when applying them to knowledge-intensive tasks. In this paper, we propose a new paradigm dubbed Knowledge Rumination to help the pre-trained language model utilize that related latent knowledge without retrieving it from the external corpus. By simply adding a prompt like"As far as I know"to the PLMs, we try to review related latent knowledge and inject them back into the model for knowledge consolidation. We apply the proposed knowledge rumination to various language models, including RoBERTa, DeBERTa, and GPT-3. Experimental results on six commonsense reasoning tasks and GLUE benchmarks demonstrate the effectiveness of our proposed approach, which proves that the knowledge stored in PLMs can be better exploited to enhance performance. Code is available in https://github.com/zjunlp/knowledge-rumination.


Introduction
Pre-trained language models (PLMs) have waved the NLP community as fundamental infrastructure by demonstrating remarkable abilities with the "pre-train, prompt, and predict" paradigm (Liu et al., 2023b;Zhao et al., 2023).The mere PLMs, however, lack the capacity to handle knowledgeintensive tasks with advanced functionalities like commonsense reasoning (Lin et al., 2019;Qiao et al., 2022;Liu et al., 2023a) and open-domain question answering (Yang et al., 2015).This necessitates a boosting trend for research focusing Q: If a bird is a carnivore, then it is likely a(n) A). prey B). predator C). herbivore D). canary
However, despite the empirical success, we observe that PLMs can often encode extensive knowledge within their parameters yet fail to utilize this effectively for knowledge-intensive tasks.
Taking pilot experiments as an example, we use knowledge probing (Petroni et al., 2019) to the PLM as shown in Figure 1.Given a question "If a bird is a carnivore, then it is likely a(n) what?", we notice that the PLM has known the knowledge "a carnivore is likely a(n) predator" in its parameters; however, we surprisingly find that the finetuned PLM chose the wrong answer despite the model knowing the related knowledge.Interestingly, this phenomenon mirrors human behavior.As an example, in the cognitive reflection test (CRT) (Frederick, 2005), participants have posed a series of straightforward questions (already learned), yet they often initially fail in their intuitive reason-ing.Upon reflection, however, individuals typically identify their erroneous responses and correct them.Consequently, we conjecture that the prominent PLMs of today have flaws as humans and we still have the following problem: are we fully exploiting the potential of the PLMs?Some pioneering researchers have attempted to unravel this enigma.For instance, Chen et al. (2022b) and van de Kar et al. (2022) propose to utilize the knowledge in the pre-traning corpus by retrieve-then-fine-tuning method.Likewise, Bhagavatula et al. (2020) capitalizes on the implicit knowledge within large language models (>10B) by retrieving from model weights with recitationaugmented generation.These studies affirm that PLMs encapsulate a vast body of knowledge, with untapped potential, while in our paper, we pursue a more universally applicable, yet simple solution to fully harness knowledge in PLMs for NLP.
To address this need, we introduce Knowledge Rumination to assist the model in thinking thoughtfully in handling knowledge-intensive tasks.Analogous to how animals ruminate food for better digestion and absorptionby regurgitating it from the stomach back to the mouth for additional chewingwe aim to mimic this process by having the model first review the relevant knowledge stored in its parameters and then consolidate this knowledge to better tackle associated tasks.
In detail, we propose knowledge reviewing with a task-guided prompt by simply adding "As far as I know" to stimulate the model to recall latent knowledge.Subsequently, we consolidate knowledge via FFN to explicitly leverage latent knowledge to help address downstream tasks since FFN plays a crucial role in PLMs (Wang et al., 2022).
We apply the proposed knowledge rumination to various PLMs, including RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021).We also transfer knowledge rumination to large language GPT-3 (175B) (Brown et al., 2020).Experimental results on six commonsense reasoning tasks and the GLUE benchmark demonstrate that the proposed simple method can obtain performance gain and even outperform baselines of retrieving external knowledge.To conclude, we summarize the contributions of this work as follows: • We propose a novel approach of Knowledge Rumination to better utilize the knowledge stored in the parameters, which is model agnostic and can be applied to any PLMs • Experimental results demonstrate that the proposed approach can successfully elicit related knowledge for both small and large PLMs, yielding better performance on six commonsense tasks and GLUE benchmarks.
• Comprehensive empirical analysis indicates that still a large underestimated amount of knowledge can be retrieved from PLM's model weights, and our work takes a small step in this direction.

Related Work and Background
Extracting Knowledge from PLMs Previous studies have shown that PLMs implicitly contain a large amount of knowledge.Petroni et al. (2019) have shown that such language models can be used in a Knowledge Base (KB) completion task by converting KB relations into natural language templates.Based on this finding, researchers attempt to treat the PLM as a knowledge base.Some studies (Bosselut et al., 2019;West et al., 2022;Hao et al., 2022) employ PLMs to construct knowledge graphs automatically.Meanwhile, some others (Shwartz et al., 2020;Li et al., 2022) find that the knowledge possessed by the PLMs can be used to enhance the model's performance in downstream tasks.To date, several work (Wang et al., 2023;Zelikman et al., 2022;Bhagavatula et al., 2020) attempt to utilize PLMs to generate free-text rationales for reasoning.Our approach differs from previous works in that we aim to enhance the model's understanding of what it already knows in order to improve performance.
Knowledge-Enhanced Models Researchers resort to external sources to facilitate the model's ability to deal with knowledge-intensive situations.One direction is to ground the question in a KB and conduct inference with both the question and the retrieved knowledge (Yasunaga et al., 2022(Yasunaga et al., , 2021;;Zhang et al., 2022;Sun et al., 2019;Yao et al., 2022;Lv et al., 2020;Lin et al., 2019).
Since the pre-trained model can also be viewed as a knowledge store, several recent studies including Self-talk (Shwartz et al., 2020), Rainier (Liu et al., 2022a), GKP (Liu et al., 2022b), Elic-itKnowledge (Li et al., 2022) propose to treat the large language model (e.g., GPT-3) as an external source to elicit knowledge for downstream tasks.
In contrast, our approach diverges from relying on external sources such as knowledge bases (KB) or language models (LM).Instead, we concentrate on fully leveraging the latent knowledge acquired by the model itself.There are also some kinds of work that decompose the question into subquestions and ask the model to answer each subquestion such as least-to-most prompt (Zhou et al., 2023).However, even these approaches encounter the issue we proposed where the model may possess the answer to the sub-question within its parameters but fails to provide the correct response.
The underlying intuition behind our method is that current methods for harnessing the power of pretrained language models (PLMs) have not fully tapped into the knowledge residing within the model's parameters.Note that our approach most closely aligns with Self-talk (Shwartz et al., 2020), but with an additional capability to manage parametric knowledge (such as embeddings in Feed-Forward Networks).This capability broadens the spectrum of the academic idea to a certain extent.

Knowledge Rumination
In this section, we introduce technical details of knowledge rumination to tap into the potential of PLMs ( §3.1).Given a PLM G, we first freeze the model parameters and design a task-specific prompt ( §3.2) to guide the model in reviewing its stored knowledge regarding the task and input (knowledge reviewing).We then consolidate the model's latent knowledge ( §3.3) during tuning downstream tasks (knowledge consolidation).

Model Architecture
We take a representative task, multiple-choice commonsense reasoning, as an example to elucidate the details of knowledge rumination, which can be simply adapted to any other tasks in NLP.Given a question q, multiple-choice commonsense reasoning aims to selecting the correct answer a k ∈ A provided with an optional context c.The set of possible answers A is finite and varies for each question.In the vanilla setting, the PLM is used to directly answer the question by selecting the answer choice â with the highest score, based on the concatenation of the question q, context c, and one possible answer choice a i as: Here, before making a prediction, we ask the model to carefully consider the question and re-view its prior knowledge.We freeze the PLM G θ to probe the knowledge it has stored (θ represents the model's parameter) and prepend trainable continuous tokens to each layer.For each question q, we create a unique prompt p q to guide the model in reflecting on its knowledge: Then, the PLM will reinforce its knowledge r of the problem and infer the answer augmented with r.Ideally, the model is supposed to generate helpful knowledge texts for the question.However, training the model requires expensive knowledge annotations for all training instances.To handle this problem, we use the model's contextualized representation output as the knowledge for rumination and leverage it as a latent variable.The model will answer the question based on both the question q and the vectorized knowledge r: Then, the cross-entropy loss is used to train the whole model.Assuming the answer a k is correct, the loss can be obtained as follows: where Q(a i | c, q) is 1 if a i = a k and 0 otherwise.
During training, the gradient flows back into the model, which helps it learn to review and consolidate useful information.

Knowledge Reviewing with Task-guided Prompting
Analogically, animals return partially digested food from the stomach to the mouth for rechewing; we design specific prompts for each question to probe the latent knowledge for rumination.As shown in Figure 2, we begin by using the background prompt: "As far as I know, [MASK]".Note that humans consider mentions in the descriptions to better understand the question.For example, when answering the question " If a bird is a carnivore, then it is likely a(n) what?", humans would consider the mentions of bird and carnivore to better comprehend the question.We further introduce the mention prompt to review knowledge of mentions.Specifically, we extract mentions M from the questions using off-the-shelf tools2 and build prompts to elicit memories about these mentions.However, it should be noted that some mentions contain unrelated information and may divert attention.To address this, we propose "mention relevance scoring," where we utilize an encoder model G enc to evaluate the relevance of each mention in relation to the question context.Specifically, we compute the relevance score for each mention m ∈ M by concatenating the text with the question q and using the output of "[CLS]" as the score (f cls in the following equation): We sample mentions with the Top-2 relevance scores ρ as the target mentions.The details of "mention relevance scoring" can be found in Appendix A.5. Actually, apart from this commonsense or context knowledge, it is important to note that PLMs also store other types of knowledge, including skill knowledge and task knowledge (Wang et al., 2022).Hence, we assume that the model may acquire latent knowledge regarding the task itself, so we further build task prompts to encourage the model to reflect on the skill knowledge encoded within its parameters.Overall, we probe the PLM using three different types of task-guided prompts, along with three areas of interest: • Background Prompt: is designed to help the model to think about the background, the prompt is As far as I know [MASK].
• Mention Prompt: is used to elicit memories of mentions, the formation is About <Men-tion>, I know [MASK].
• Task Prompt: is designed to help the model reminisce memories of the task.For example, for the sentiment analysis, the prompt is About sentiment analysis, I know [MASK].
We put several '[MASK]' in the task-guided prompt, and the length of the '[MASK]' is a hyperparameter for different tasks.
Our approach can be applied to different PLMs.For the encoder-style model like RoBERTa, we utilize the hidden states h [MASK] of the '[MASK]' as the latent knowledge for rumination.f mask in the following equation means taking the h [MASK] from the model's output.
The following section will explain how to utilize this elicited knowledge.

Knowledge Consolidation with FFN Neuron Augmentation
To reinforce its understanding, the model should re-digest (inject) the elicited knowledge r of the q, similar to how animals chew their food again.However, where to inject the PLMs remains a challenging issue, indicating potential work to investigate how the model's knowledge is kept.Previous studies (Dai et al., 2022;Wang et al., 2022) have discovered that the Feed Forward Network works (FFN) as the knowledge neuron or skill neuron, illustrating that FFN may store factual information and encode task-specific skills.Inspired by these findings, we incorporate r into the FFNs, as previous work (Yao et al., 2022) does.Here, we select the Top-1 layer to re-diest (inject) the knowledge.Suppose the two linear layers in FFN emulate as a key-value network K and V , we employ two distinct linear layers to project the information r to the vector space of the matching layer: where W k and W v represents the weights of the two linear layers (W k , W v ∈ R d×d , d is the intermediate size of the PLM).The two matrices, W k and W v , are initialized randomly and will be updated during training.We expand the FFN by concatenating the projected knowledge to the end of the linear layer and obtain the expanded K E , V E .The computing can be described as follows: H denotes the output hidden states of the selfattention module.The model would answer the question with the help of regurgitated knowledge.
More details about baseline models can be seen in Appendix A.3.2.For the tasks in the GLUE benchmark, in addition to RoBERTa_large, we compare our model with a prompt learning method, LM-BFF (Gao et al., 2021).LM-BFF proposes a prompt-based finetuning method and a refined Here, we report the results on the validation set following LM-BFF (Gao et al., 2021).The prompt used here is the same as LM-BFF.
strategy for dynamically and selectively incorporating demonstrations into each context.In our paper, we simply use the human-curated prompt (provided by LM-BFF) and leverage RoBERTa_large as the backbone.

Experiment Implementation
In the stage of knowledge reviewing with taskguided prompting, the backbone of the model is frozen, and we only update the prepended trainable continuous tokens (prefix prompt).When implementing the DeBERTa, due to the complex attention mechanism, we simply freeze the whole model and do not add the prefix tokens.For the commonsense reasoning task, we combine the mentioned prompt and background prompt, and for the GLUE benchmarks, we find the task prompt to be more useful.More details can be found in Appendix A.2 and A.3.

Main Results
We list the results in Table 1 for the commonsense reasoning tasks and 1) the knowledge stored in the parameters is robust and requires explicit activation during finetuning.2) the performance of the model that retrieves knowledge from external sources is impacted by the quality and relevance of the knowledge sources, while our rumination methods can produce more pertinent knowledge.
As shown in Table 2, the results of the GLUE benchmark demonstrate that the knowledge rumination method outperforms the basic finetuning model RoBERTa and the prompt-based method LM-BFF, with an average improvement of +1% for LM-BFF and 3% for RoBERTa.These gains in performance highlight the effectiveness of knowledge rumination methods compared to finetuning and prompt learning.

Out-of-Distribution (OOD) Performance
To better illustrate the wide applicability and generalization prowess of the knowledge rumination method, we extended our evaluation to incorporate performance on out-of-distribution (OOD) test sets.

Impact of Different Knowledge Consolidation Methods
Apart from injecting knowledge in FFN, we compare and evaluate another injection method: Concatenation.Since the knowledge r is a vector, we concatenate it into the sequence after the embedding layer and report the results in Table 4.We notice that both methods benefit from knowledge rumination.Typically, integrating the knowledge through feed-forward networks (FFN) demonstrates better performance than concatenation, with an average improvement of +0.5% in So-cialIQA and +2.1% in OBQA.This supports previous research findings that Feed-Forward Networks (FFNs) store some factual knowledge (Dai et al., 2022;Yao et al., 2022) and that our method can ef-  fectively consolidate this knowledge within FFNs.

What does the Model Ruminate?
Despite the advantages of knowledge rumination, it is essential to understand the nature of the knowledge generated and the mechanism behind the method.In our model, the model produces a contextualized embedding r as latent knowledge.To make the knowledge more interpretable to humans, we convert the vectorized knowledge to symbolic text.In order to evaluate the effectiveness of the method, we sample successful cases where the simple finetuning model makes the incorrect prediction while our knowledge rumination method provides the correct answer.Figure 3 illustrates an example from the Com-monsenseQA task.To generate the output words in the vocabulary space, we apply the masked language modeling (MLM) head over the position of the '[MASK]'.We notice that the masked word often includes similar information to the answer.For example, in the question "Who is likely to have a caring heart?", the '[MASK]' token contains words such as 'people,' 'heart,' and 'like.'In addition, we map the knowledge rumination output r to the pre-trained corpus space as the memory is constructed during pre-training.We conduct a dense embedding similarity search to identify what our generated contextualized representation is most similar to.In this case, we represent the ruminated knowledge by taking the average of the '[MASK]' embedding.For each sample from external sources, we add a '[MASK]' token at the end of the sentence and use the '[MASK]' to represent the sentence.We use the pre-trained corpus Wikipedia (Milne and Witten, 2008) as the retrieval source and employ FAISS (Johnson et al., 2021) for dense vector search.Interestingly, we notice that the model recalls its memory of a person with a caring heart, "a good person.He helps victims and people.".This suggests that the model has remembered this information during pre-training, and if it is given a chance to think, the model is aware of what it has learned.Besides, we also map the knowledge into the external knowledge source ConceptNet (Speer et al., 2017) since CommonsenseQA is derived from it.More details can be found in Appendix A.6.

Transfer to LLMs
In this part, we endeavor to transfer knowledge rumination to LLMs.Compared to small language models, LLMs demonstrate an excellent ability to recall and deal with knowledge by fewshot demonstration as a prompt.GKP (Liu et al., 2022b) notes that pre-appending knowledge retrieved from LLMs can facilitate both the LLMs and other models in effectively handling tasks.As such, we follow suit (Liu et al., 2022b,a) in our knowledge review process, using prompts to recall the memorized knowledge for the target inputs.Nevertheless, simply concatenating the recalled knowledge might not lead to effective utilization by the model.Here, in the knowledge consolidation phase, we explicitly ask the model to attentively consider the recalled knowledge by crafting a stimulus prompt "According to the [knowledge], the answer is" or "think by the [knowledge]".Furthermore, recent work suggests that Chain-Of-Thought (COT) (Wei et al., 2022) can elicit language LLMs' reasoning ability by employing a se-ries of intermediate reasoning rationales.In contrast, the knowledge generated by the Knowledge Rumination model contains implicit information derived from pre-trained language models during pre-training which may otherwise be overlooked by the reasoning steps.Here, we report the results on GPT-3 Davinci (175B) with knowledge rumination in Figure 4 and compared with original few-shot GPT-3, GKP, and COT.The implementation details can be found in Appendix A.3.2 and the demonstrations can be found in Appendix A.2.
Our findings indicate that the performance of GPT-3 can be significantly enhanced through knowledge rumination, as evidenced by the 16% improvement in OBQA accuracy, 12% in CSQA, and 11% in SocialIQA.Compared to GKP, it's evident that merely concatenating the elicited knowledge doesn't adequately leverage it.In contrast, the knowledge rumination approach surpasses GKP by an average of 6%, demonstrating its efficacy.What's more, knowledge rumination attains better performance than COT on OBQA and CSQA except for the SocialIQA, demonstrating the effectiveness of the background knowledge.In this preliminary exploration, we discovered that guiding LLMs to deliberate thoroughly on recalled knowledge can augment their understanding and reasoning capabilities.Looking forward, enhancing the utilization and integration of the model's inherent knowledge for reasoning remains a promising area for future investigation.
Error Analysis We conduct an error analysis on the evaluation examples from the OBQA and CSQA datasets for the GPT-3 model.We categorize the errors into four categories: 1): Failure to Utilize: the model recalls helpful information but does not provide the correct answer.2): Ineffective Rumination: the rumination information with the highest logprobs is irrelevant to the question, but there are some relevant ones in the remaining.
3): Incorrect Memory: the model's stored information about the question is incorrect.4): Missing Information: the model does not have the necessary information about the problem.Examples for each error type can be seen in Appendix A.8.The statistics are presented in Table 5.We observe that the majority of errors are caused by missing information, indicating that large pretrained language models still have difficulty retaining all the knowledge acquired during pre-training.Additionally, our method still has difficulty acti-  vating all the stored knowledge, as 32% of error cases are caused by ineffective rumination and 18% by failure to utilize retrieved information for the CSQA task.This suggests that there is still room for improvement in this area.

Conclusion and Future Work
In this work, we propose knowledge rumination for PLMs, which can serve as a general solution to exploit latent knowledge for downstream tasks and demonstrate promising results.This concept is akin to humans often erring when answering without thorough thinking.In the future, we plan to apply knowledge rumination to more NLP tasks and more types of models.

A Appendix
A.1 Detailed Comparison with Previous Approaches Specifically, we note that Self-talk (Shwartz et al., 2020), Rainier (Liu et al., 2022a), GKP (Liu et al., 2022b), and ElicitKnowledge (Li et al., 2022) all harness knowledge, in the form of text sequences, extracted from pre-trained language models to enhance performance in knowledge-intensive tasks.
Similarly, the concept of Knowledge Rumination draws from the same inspiration, enabling pre-trained language models to leverage related latent knowledge without the need for retrieval from an external corpus.Among these prior studies, our method bears the closest resemblance to Selftalk (Shwartz et al., 2020), with the added capability of Knowledge Rumination to handle parametric knowledge (e.g., embeddings in Feed-Forward Networks).This extends the scope of the academic concept to a certain degree.
Our work also shares a connection with COT (Wei et al., 2022).However, while COT generates rationales (texts) and appends them to output sequences to assist reasoning, our Knowledge Rumination model generates implicit knowledge and integrates it with the input sequence to produce desired results.Additionally, COT is primarily focused on reasoning, thus its rationales serve as intermediary steps in the reasoning process.By contrast, the knowledge generated by our Knowledge Rumination model constitutes implicit information derived from pre-trained language models during the pre-training phase.

A.2 Prompts for Knowledge Rumination with LLM
Table 7 through Table 9 shows the full prompts for knowledge rumination that we use for each evaluated task (demonstrations are derived from Liu et al. (2022b,a)): CSQA, OBQA, and SO-CIALIQA.

A.3 Experimental Settings
In this section, we describe the implementation of our experiments in detail, including the baseline methods, backbone models, and hyperparameters.Our model is built based on the Huggingface framework (Wolf et al., 2020).Unlike finetuning, which updates all model parameters θ of a PLM, prefix-tuning freezes all pre-trained Transformer parameters and only optimizes prefix vectors that are prepended to each Transformer layer.We use prefix-tuning (Li and Liang, 2021) to train the knowledge reviewing model to reflect information for each task because: 1) the rumination models for different tasks can share the same backbone Transformer parameters, with only the prefix vectors being different.2) Prefix-tuning has comparable performance to finetuning but avoids the risk of catastrophic forgetting.
For the tasks in the GLUE benchmarks, most of the hyperparameters are the default parameters of LM-BFF.For commonsense reasoning tasks, we follow previous preprocessing from QA-GNN (Yasunaga et al., 2021).We chose RoBERTa (Liu et al., 2019) large andDeBERTa (He et al., 2021) large as our backbone models, and the average training time for each model is 2 to 4 hours.We apply grid search for each hyperparameter tuning.

Dragon
( Yasunaga et al., 2022) is a deeply joint language-knowledge foundation model pretrained from text and KG at scale, which achieves strong performance on reasoning about language and knowledge.
GKP (Liu et al., 2022b) prepends the knowledge before the question.The original paper concatenates knowledge before each candidate and compute the probability for each candidate.In our setting, to compare with COT, we provide all the candidates and ask the LLM to obtain the final answer.
COT (Wei et al., 2022), we use the chain-ofthought provided by previous work (Fu et al., 2023) 3 and utilize the same number of demonstrations for GKP and Knowledge Rumination.

A.4 Evaluation Metrics
For the commonsense reasoning task, we use Accuracy as the evaluation metric.For the GLUE benchmark, we use the same metric in the original paper.

A.5 Mention Relevance Score
To score the relevance of each mention conditioned on the question context ( §3.2), we use the sentence embedding model: all-roberta-large-v1 from SentenceBert (Reimers and Gurevych, 2020) for calculating cosine-similarity.To identify what our generated contextualized representation is similar to, we use the pre-trained corpus Wikipedia (Milne and Witten, 2008) and external knowledge source ConceptNet (Speer et al., 2017) as the retrieval sources (Table 6).For efficient similarity search, we use the 1024dimensional hidden representations to create a FAISS (Johnson et al., 2021) index and search for top-20 similar triples/samples.By the way, the type of faiss-index is indexPQ, which bases on a product quantizer.Stored vectors are approximated by PQ codes.
For ConceptNet, we follow KagNet (Lin et al., 2019), which uses sentence template for generating TRIPLESTRING like diamond can be in jewelry store.Additionally, we add a '[MASK]' token at the end of the TRIPLESTRING and then feed it as text inputs.For Wikipedia, we retrieve 10000 samples for each sentence based on the prebuilt index in Pyserini (Lin et al., 2021) and then use the original text as inputs.

A.7 Downstream Evaluation Datasets
We use the following six commonsense reasoning benchmarks for the experiments in the general domain ( §4) CommonsenseQA (CSQA) (Talmor et al., 2019) is a 5-way multiple-choice QA task testing commonsense reasoning.The dataset has 12,102 questions.We use the in-house data splits by (Lin et al., 2019).
OpenbookQA (OBQA) (Mihaylov et al., 2018) is a 4-way multiple-choice QA task containing elementary science questions.It has 5,957 questions.We use the original data splits in (Mihaylov and Frank, 2018).
Social Interaction QA (SocialIQA) (Sap et al., 2019) is a 3-way multiple-choice QA task testing social commonsense reasoning.It has 37K questions.We use the original data splits in (Sap et al., 2019).
Physical Interaction QA (PIQA) (Bisk et al., 2020) is a 2-way multiple-choice QA task testing physics reasoning about objects.It has 20K questions.We split the dev set in half to make in-house dev/test sets.
HellaSwag (Zellers et al., 2019) is a 4-way multiple-choice task testing grounded commonsense reasoning about events.It has 70K questions.We split the dev set in half to make in-house dev/test sets.

A.8 Examples for Different Error Case
In this section, we shows one example for each error type.For each example, we list the knowledge in descending order by the probability score.
It only takes the highest-score knowledge (the knowledge in bold) as rumination information in our experiment.
A.8.1 Failure to Utilize Question: What do people typically do while playing guitar?Answer: Singing Knowledge List: • People play guitar while singing.
• Playing guitar is an activity.
• Playing guitar is a hobby.
• People usually play guitar while singing.
• People play guitar to entertain others In this example, "People play guitar while singing" has shown the correct answer to the models, but it still remains wrong.

A.8.2 Ineffective Rumination Question:
What do people aim to do at work? Answer: Complete Job Knowledge List: • People work to earn money.
• People aim to earn money.
• People aim to get their work done.
• People aim to do their work.
• People aim to do their job well.
In this example, "people work to earn money" has nothing to do with "Complete Job", while the third one in the list, "People aim to get their work done.",conveys the meaning of "Complete Job".

A.8.3 Incorrect Memory Question:
Where can a human find clothes that aren't pants?Answer: Dress Shop Knowledge List: • A human can find clothes that aren't pants at the beach.• Pants are a type of clothing.
• Clothes that aren't pants are dresses and skirts.• Pants are the most common type of clothing.• Pants are not the only type of clothing.In this example, "A human can find clothes that aren't pants at the beach" is the wrong information for the question.

Figure 1 :
Figure 1: A pilot experimental case for the motivation of knowledge rumination.The PLM succeeds in probing the related commonsense knowledge but fails to obtain the answer with finetuning.

Figure 2 :
Figure 2: An illustration of different methodologies to utilize the PLM.(a): standard finetuning, (b): prompt learning, and (c) the proposed knowledge rumination method.During the knowledge reviewing with task-guided prompting ( §3.2), the model parameters are frozen.h [MASK] (the hidden vector of "[MASK]") is the elicited latent knowledge, which will be injected into FFNs for knowledge consolidation ( §3.3).
Who is likely to have a caring heart ?As far as I know, he is still a good person.He helps victims and people …… always shows huge concern for his friends.

Figure 3 :
Figure 3: Case study of Knowledge Rumination in CommonsenseQA dataset.Given the question, we map the '[MASK]' token into vocabulary space, pre-train corpus space and external corpus space.We observe that successful knowledge rumination always contains similar information with answers.

Figure 4 :
Figure 4: Results of GPT-3 on commonsense reasoning datasets.Baseline refers to GPT-3 answering the question directly with few-shot demonstrations.

Table 1 :
Yasunaga et al. (2022) commonsense reasoning tasks.Scores of the methods marked with * are taken fromYasunaga et al. (2022).As the official tests for CSQA, PIOA and HellaSwag are hidden, here we report the in-house Dev (IHdev) and Test (IHtest) accuracy, following the data split inYasunaga et al. (2022).

Table 3 :
OOD Results.Performance (accuracy) of the compared methods, which are firstly trained on a source dataset and then directly conduct prediction on a target dataset (denoted as source ⇒ target).

Table 5 :
Error analysis on OBQA and CSQA.

Table 6 :
Statistics of Retrieved corpus.