Crystal: Introspective Reasoners Reinforced with Self-Feedback

Extensive work has shown that the performance and interpretability of commonsense reasoning can be improved via knowledge-augmented reasoning methods, where the knowledge that underpins the reasoning process is explicitly verbalized and utilized. However, existing implementations, including"chain-of-thought"and its variants, fall short in capturing the introspective nature of knowledge required in commonsense reasoning, and in accounting for the mutual adaptation between the generation and utilization of knowledge. We propose a novel method to develop an introspective commonsense reasoner, Crystal. To tackle commonsense problems, it first introspects for knowledge statements related to the given question, and subsequently makes an informed prediction that is grounded in the previously introspected knowledge. The knowledge introspection and knowledge-grounded reasoning modes of the model are tuned via reinforcement learning to mutually adapt, where the reward derives from the feedback given by the model itself. Experiments show that Crystal significantly outperforms both the standard supervised finetuning and chain-of-thought distilled methods, and enhances the transparency of the commonsense reasoning process. Our work ultimately validates the feasibility and potential of reinforcing a neural model with self-feedback.


Introduction
Commonsense reasoning poses unique challenges to neural reasoning models.The underlying knowledge that grounds the reasoning process is often obscure and inexplicable, even to humans as we mainly rely on intuitive inference for such problems (Mercier and Sperber, 2017).This is in stark contrast with multi-step logical reasoning (e.g., math problems, logical deductions), where the reasoning process consists of explicit and closedworld deduction steps.Chain-of-thought (CoT) (Wei et al., 2022) and its variants have been successful in multi-step logical reasoning, yet their effectiveness on commonsense reasoning is marginal, largely due to the lack of the above observation when designing their few-shot prompts.Nevertheless, generating the reasoning process is still instrumental for commonsense reasoning, as it improves both performance and interpretability of neural models (Shwartz et al., 2020;Liu et al., 2022a, i.a.).For such knowledge-augmented reasoning approach, two components are indispensable: (1) introspecting for relevant, high-quality knowledge, and (2) effectively and faithfully utilizing the knowledge to make informed final predictions.
Our key insight is that these two components are deeply adaptive to each other: knowledge introspection should aim to produce knowledge that would be most beneficial to grounding the subsequent reasoning, and knowledge-grounded reasoning should learn to best leverage the previously introspected knowledge.Existing literature does not comprehensively optimize these two components and the bi-directional interaction between them, and comes with additional complications.Knowledge-augmented reasoning methods largely employ task-specific engineering for knowledge generation and are thus difficult to generalize to unseen domains.As for CoT and its variants, the reasoning chains hardly provide meaningful information due to deficiency in their prompt design (Figure 1).
We aim to systematically address the above considerations and build a strong, interpretable and generalizable model for commonsense reasoning.The introspective reasoner we develop, named CRYSTAL, tackles commonsense problems by the following (illustrated in Figure 1): it first invokes a knowledge introspection mode to generate knowledge statements related to the given question, and subsequently invokes a knowledge-grounded reasoning mode that ingests both the question and the previously introspected knowledge to predict an answer.CRYSTAL is trained with reinforcement learning (RL) to improve the synergy between the reasoning paths and the final predictions.The knowledge introspection mode of the model is trained with PPO (Schulman et al., 2017) to optimize a reward function that characterizes if the generated knowledge can fix prediction errors made by the knowledge-grounded reasoning mode of the model.In this sense, CRYSTAL is reinforced with selfgenerated feedback.Concurrently, the knowledgegrounded reasoning mode evolves to better utilize the introspected knowledge statements for more accurate predictions.These two learning objectives are harmonized through a novel interleaved optimization schedule, echoing the principles of the EM algorithm (Dempster et al., 1977).We employ a two-stage training process: the RL training stage is preceded by a supervised training stage, where CRYSTAL acquires preliminary skills to generate and utilize knowledge by imitating a larger LM (e.g., GPT-3).
Experimental results on 25 commonsense QA benchmarks (10 seen, 15 unseen) show that CRYS-TAL not only enhances performance within fixed model sizes, but also amplifies the interpretability of the reasoning process.CRYSTAL outper-forms direct QA models finetuned with standard supervised learning and the same data, improving absolute accuracy by 1.5% to 2.5% on different model sizes, and showcases good generalization to unseen benchmarks.This highlights the benefits of introspective reasoning over direct inference.Additionally, CRYSTAL substantially outperforms models distilled from CoT produced by large LMs.Through CRYSTAL, we illustrate the potential and viability of reinforcing neural reasoning models with self-feedback.An additional benefit of our approach is the memory and time-efficient implementation of PPO via model sharing, which allows this state-of-the-art RL algorithm to be applied to larger models with given amount of resources.

Method
We will first introduce the concept of introspective reasoning ( §2.1), followed by a description of our introspective reasoner, CRYSTAL, including its basic functionality ( §2.2), training objectives ( §2.3), adaptation of the RL algorithm and efficiency improvements ( §2.4), and the design of model training process ( §2.5, §2.6).

Introspective Reasoning
Conventionally, commonsense QA models are designed to directly predict answers for questions (e.g.Lourie et al., 2021).These models operate like black boxes and their predictions are difficult to interpret.We consider introspective reasoning, where a system first introspects for commonsense knowledge statements that are relevant to reasoning about the given question, and subsequently makes an informed prediction that is grounded in the introspected knowledge.We refer to the former step as knowledge introspection, and the latter step as knowledge-grounded reasoning.
Figure 1 exemplifies the introspective reasoning process.Given the question "What comes from cows?" with the correct answer "nutritious fluids" provided among other incorrect choices, the system first generates knowledge statements like "Cows produce milk."Taking this knowledge statement as additional input, the system then invokes knowledge-grounded reasoning and makes a correct prediction.
Introspective reasoning has promise in enhancing model performance on commonsense reasoning while making the reasoning process more interpretable.The introspected knowledge reveals the

CRYSTAL
CRYSTAL, the introspective reasoner that we develop, is a unified model that supports the end-toend workflow of an introspective reasoning system.CRYSTAL has two modes of operation: knowledge introspection and knowledge-grounded reasoning.In knowledge introspection, the model accepts a question as input, and outputs a knowledge statement relevant to the question.In knowledgegrounded reasoning, the model ingests both the question and the previously generated knowledge statement as input, and outputs a prediction.The system produces an final prediction by consulting and aggregating predictions resulted from all the available reasoning paths, effectively harnessing the power of ensembled reasoning.
I/O format.Since CRYSTAL has two modes of operation, it needs to discern when to conduct knowledge introspection and when to engage in knowledge-grounded reasoning.Drawing inspiration from Tafjord and Clark (2021), we structure the input/output format as demonstrated in Table 1.This format is adapted from UnifiedQA (Khashabi et al., 2020), and we detail our modifications in §A.
Notation.CRYSTAL is a sequence-to-sequence generative model parameterized by θ.In knowledge introspection, the modeling of knowledge k given question q is denoted as p QK (k|q; θ); in knowledge-grounded reasoning, the modeling of answer prediction a is denoted as p QKA (a|q, k; θ).

Training Objectives
To yield the desired outcome of an introspective reasoning system, we need to make knowledge introspection and knowledge-grounded reasoning welladapted to each other.The knowledge introspection component should aim to produce knowledge that would be most beneficial to ground the subsequent reasoning, and the knowledge-grounded reasoning component should learn to best leverage the previously introspected knowledge.We design the training objectives to account for this mutual adaptation.
Adapting reasoning to introspection.Suppose a knowledge statement is sampled online from CRYSTAL in the knowledge introspection mode: k ∼ p QK (k|q; θ).We use standard supervision and minimize a knowledge-grounded reasoning loss: where a * is the correct answer for question q.
Adapting introspection to reasoning.where A is the candidate set for question q, and ε stands for no knowledge; and vice versa for a bad knowledge statement.However, a knowledge statement consists of a sequence of discrete tokens, rendering standard gradient methods infeasible for optimizing the introspected knowledge.Following Liu et al. (2022a), we formulate the problem as reinforcement learning (RL), and optimize a reward function that characterizes the desirability of knowledge: where s(a|q, k) is the pre-softmax logit of p QKA (a|q, k; θ) on the single-token answer a.The reward approaches +1 for good knowledge statements and −1 for bad ones.
We use the PPO algorithm to optimize this reward.
A knowledge introspection loss L PPO (θ) can be defined as a function of the reward and the model parameter θ, following Liu et al. (2022a).Since this training loss is derived from the downstream knowledge-grounded reasoning results produced by the same model, the model is reinforced with feedback given by itself.
During training, the two objectives, L PPO (θ) and L QKA (θ), are optimized under an interleaved schedule (rather than jointly), which is described in §2.6.
Leaving out the direct QA objective.To prevent the model from taking reasoning shortcuts and encourage it to leverage the introspected knowledge, we deliberately left out a potential, direct QA objective: As we will show in experiments, including this direct QA loss hurts performance, probably by allowing the model to take shortcuts around the knowledge.

PPO and Model Sharing
PPO, or Proximal Policy Optimization (Schulman et al., 2017), is an RL algorithm that has been widely used in aligning LMs with human feedback (Stiennon et al., 2020;Ouyang et al., 2022;OpenAI, 2022;Wu et al., 2023).It is also adopted by Liu et al. (2022a) to train their knowledge introspection model, Rainier.Within the context of PPO terminology, CRYS-TAL's knowledge introspection mode assumes the role of the policy model, while its knowledgegrounded reasoning mode functions as the reward model.PPO further employs a value model to estimate the value function for states containing partial knowledge statements, and we propose to reuse the parameters of CRYSTAL for the value model as well.Consequently, while in conventional PPO the policy, value and reward models are parameterized separately, when training CRYSTAL they share the same underlying parameters.CRYSTAL is essentially a generative LM equipped with two heads: an LM head that comes into play during the policy and reward modeling, and a value regression head that is activated in value modeling.This model sharing results in improved memory and time efficiency for PPO training, as discussed in §4.4.

Two-Staged Training
PPO requires that the policy model is initialized from a reasonably good state.Typically, PPO training follows a supervised finetuning stage for the policy model (Stiennon et al., 2020;Ouyang et al., 2022).For Rainier, an imitation learning stage, during which the model is supervised on silver knowledge statements obtained from a few-shot GPT-3, precedes the RL training stage.This imitation learning stage imparts the model with the preliminary skill of generating question-specific knowledge, and sets a promising starting point for RL.
Drawing from this concept, we employ a twostage training process for CRYSTAL.In training stage I, the model is tuned to conduct both knowledge introspection (by imitating a few-shot GPT-3) and knowledge-grounded reasoning.We minimize two losses: a knowledge introspection loss and a knowledge-grounded reasoning loss where k is a silver knowledge statement generated by the few-shot GPT-3.In training stage II, we follow the procedure in §2.3 to adapt the knowledge introspection and knowledge-grounded reasoning modes to each other.

Interleaved Optimization Schedule
Through empirical analysis, we have observed interleaving the two training losses in each stage yields beneficial outcomes as opposed to jointly optimizing them.In stage I, we optimize L QK for a specific number of iterations, followed by optimizing L QKA for another set number of iterations, repeating this cycle.Similarly, In stage II, we optimize L PPO for a designated number of iterations, followed by optimizing L QKA for another set number of iterations, repeating this cycle.This design bears resemblance to the EM algorithm (Dempster et al., 1977), wherein the hidden variable corresponds to the knowledge statement.Optimizing L PPO can be likened to estimating the hidden variables, while optimizing L QKA is akin to updating the parameter estimation based on the current assignment of hidden variables.The interleaved optimization schedule is illustrated in Figure 2.

Models. Similar to
Baselines.We primarily compare CRYSTAL with models trained on the same datasets using the standard QA objective (i.e., Equation 1), referred as "Direct QA".These models are also based on the pretrained T5 of the three sizes above.Additionally, we compare our model to Rainier (Liu et al., 2022a) and several CoT-distilled models.Among these, fine-tune-CoT (Ho et al., 2022) use zero-shot reasoning chains elicited from text-davinci-002 to finetune smaller variants of GPT-3; SCoTD (Li et al., 2023a) employs a similar distillation strategy, whereas the teacher model is code-davinci-002 and the target model is OPT up to 1.3B parameters; SCOTT (Wang et al., 2023) proposes additional techniques to improve the reasoning integrity, including a contrastive decoding method to elicit more consistent reasoning chains from the teacher model and a counterfactual reasoning method to train the target model, and yet does not enable full bidirectional adaptation between the reasoning process and the final prediction, as our method does.It is worth noting that these CoT-distilled models are often trained on specific datasets, so we only present their reported performance on the datasets they were trained on.We also report the existing SOTA performance achieved by non-retrieval methods on each seen dataset. 2 We exclude retrieval-based methods for fair comparison, because CRYSTAL does not rely on retrieval from extra sources.

Performance
The performance results are presented in Table 2 and Table 3.We organize the results based on the size of the models and compare them to baseline models that are no smaller than our models.
On seen datasets, across all model sizes we experiment with, CRYSTAL consistently outperforms the direct QA baseline (by 1.5%∼2.5% depending on model size).This demonstrates that our training process is superior to the standard supervised training and brings substantial performance gains to the  model.CRYSTAL also outperforms the combination of Rainier and UnifiedQA, especially over the last five datasets (which UnifiedQA is not trained on).This shows the benefit of adapting knowledgegrounded reasoning to the introspected knowledge.
CRYSTAL performs very closely to existing nonretrieval SOTA methods, setting new SOTA on two datasets (CommonsenseQA, QASC), and has less than 3% gap on other four (OpenBookQA, PIQA, SIQA, Winogrande).It is worth noting that these SOTA methods are good on different datasets, whereas CRYSTAL is a single model with strong performance on all these benchmarks.CRYSTAL is also competitive when compared with CoT-distilled models with similar sizes.The large and 3b versions of CRYSTAL beat Fine-tune-CoT on Com-monsenseQA by 27% and 23%, respectively, despite having smaller model sizes.CRYSTAL-large is comparable to SCoTD on OpenBookQA and CommonsenseQA, and CRYSTAL-3b significantly outperforms SCOTT on CommonsenseQA (by 5%) and QASC (by 9%).
Being trained on multiple commonsense datasets, CRYSTAL exhibits good generalization to unseen datasets.As shown in Table 3, CRYSTAL achieves a 2.0%∼4.3%average accuracy improvement over the direct QA baseline on the unseen evaluation benchmarks.The largest version of our model, CRYSTAL-11b, achieves an average accuracy of over 80% on these benchmarks.

Interpretability
Beside QA accuracy, we measure whether the introspective reasoning conducted by CRYSTAL provides good interpretability to its reasoning process.We asked three NLP experts to annotate the relationship between the introspected knowledge and the final prediction.We randomly selected 100 examples (four from each of the 25 datasets, including both seen and unseen ones), and each annotator made a full pass over them.For each example, the annotator chooses one of the following labels: Table 4: Examples of CRYSTAL's introspected knowledge and predictions grounded in the knowledge.The first row of each section is the original question and the prediction made by the direct QA model; the second row is the knowledge statement generated by CRYSTAL in the knowledge introspection mode, and the prediction made by CRYSTAL under knowledge-grounded reasoning with this knowledge statement.We show correct answers in green and incorrect answers in red.
• Support: The knowledge can be part of a non-trivial reasoning chain that supports the predicted answer.
• Trivial: The knowledge is a trivial paraphrase of the question and the predicted answer.
• Repeat: The knowledge is a mere repetition of known information given in the question.
• Related: The knowledge is topically related to the question and/or the choices, but cannot be part of a reasoning chain to support or refute any of the choices.
• Unrelated: The knowledge is unrelated to the question.
• Contradict: The knowledge can be part of a reasoning chain that refutes the predicted answer, or supports a different choice.
See Table 11 (appendix) for a detailed description of these labels and some examples.
The annotators reached a moderate level of agreement (Fleiss κ = 0.53 (Landis and Koch, 1977)).As shown in Figure 3, in 34% of the cases the introspected knowledge is found to support the final prediction in a non-trivial manner.In 19% of the cases the knowledge trivially entails the prediction, and in another 31% of the cases the knowledge is otherwise related to the question.The knowledge repeats known information in the question 5% of the time, and is unrelated to the question or contradicts with the prediction 11% of the time.Overall, the reasoning process has good interpretability in the majority of cases.

Qualitative Examples
We present several examples in Table 4 to illustrate the reasoning process of CRYSTAL.In most cases, the introspective reasoning carried out by CRYS-TAL leads to more accurate predictions compared to the direct QA model.The knowledge introspected by CRYSTAL often proves to be beneficial in arriving at the correct prediction from human interpretation standpoint.For example, the knowledge "Cow produce milk" aids in concluding that "Nutritious fluid comes from cows" (with the implicit knowledge that "Milk is nutritious fluid").This showcases how the knowledge-grounded reasoning of CRYSTAL leverages introspected knowledge to reach accurate predictions.However, there are exceptional cases where the knowledge-grounded reasoning fails to incorporate the introspected knowl- edge.For example, the correct knowledge that "An anemometer measures wind speed and direction" is introspected, but CRYSTAL still predicts "air pressure" instead of "wind speed" as the thing measured by anemometers.

Memory and Time Efficiency
As mentioned in §2.4, the PPO training in CRYS-TAL improves efficiency of the conventional PPO algorithm by model sharing.In this section, through theoretical and empirical analysis, we compare the memory and time consumption of PPO training in CRYSTAL and Rainier (Liu et al., 2022a), which employs the conventional PPO.PPO in Rainier requires three different models: a policy model, a value model, and a reward model.The policy model is Rainier, while the reward model is a fixed QA model.The value model is a separate model that shares the same architecture as Rainier, with the exception that it has a value regression head instead of a language modeling head.The policy and value models are trained simultaneously, while the reward model remains frozen.Additionally, an initial version of the policy model must be retained (to calculate the KL penalty term).Therefore, a total of four models are stored, with two of them being actively updated.In each PPO iteration, Rainier requires 5 + 2s forward passes, 2s backward passes, and 2s optimizer updates on the model.(s is the number of minor steps in each PPO iteration.)This involves executing a gradient-less rollout from the policy, one gradient-less forward pass on the value model, another gradient-less forward pass on the initial policy model, two gradient-less forward passes on the reward model, and for each minor step in the iteration, conducting one forward-backward pass and one optimizer update on the policy model and the value model, respectively.
In contrast to Rainier, PPO on CRYSTAL needs to store only two models: a shared policy/value/reward model which is being actively updated, and an initial version of the policy model (to compute the KL penalty term).In each PPO iteration, CRYSTAL needs 4 + s forward passes, s backward passes, and s optimizer updates on the model: a gradient-less rollout from the policy, one gradient-less forward pass on the initial policy model, two gradient-less forward passes on the reward model, plus for each minor step in the iteration, one forward-backward pass and one optimizer update on the policy/value model.
Table 5 summarizes the theoretical memory and time consumption of Rainier and CRYSTAL, and Table 6 reports the empirical memory usage and speed in the stage II training of these models.Compared with Rainier, CRYSTAL has less trainable parameters, consumes less GPU memory, and has faster training speed.The superior memory and time efficiency of CRYSTAL enables training larger models, and a 11b model can be reinforced with 64 V100 GPUs.

Ablations
The effect of RL.We report the impact of removing RL (i.e.training stage II) in Table 7. Across different model sizes, the performance of CRYSTAL on seen datasets consistently decreases by approximately 0.5% to 0.6% when training stage II is omitted.This highlights the significance of RL in enchancing the knowledge introspection and knowledge grounded reasoning capability of CRYSTAL.Impact of the direct QA loss.We experimented with training the stage I model with the addition of the direct QA loss ( §2.3, Equation 1).As shown in Table 7, training with this additional loss hurts performance by 0.8%.We therefore did not include this loss in the training objective of CRYSTAL.
Interleaved objectives.To demonstrate the advantages of interleaving the training objectives, we explore an alternative approach using a joint objective.In this approach, during training stage I, we optimize the joint loss, L QK + L QKA , in each iteration.Similarly, during training stage II, we optimize the joint loss, L PPO + L QKA .Table 8 presents the results of this approach, where the interleaved objectives are replaced with the joint version.As such, the performance on seen datasets decreases.This suggests that the interleaving of objectives in CRYSTAL provides a benefit over the joint optimization approach.
Relation to chain-of-thought distillation.A series of work endow smaller LMs with step-by-step reasoning capabilities by distilling from chain-ofthought (CoT) generated by large LMs (Li et al., 2022;Shridhar et al., 2022;Magister et al., 2022;Ho et al., 2022;Fu et al., 2023;Li et al., 2023a;Wang et al., 2023).We share similarity with this line of work in that our part of training process (i.e., training stage I) include distilling the emergent capability of a larger LM to a smaller one.We differ in that we capture the introspective nature of knowledge required for commonsense reasoning, and we further use reinforcement learning to improve the synergy between reasoning paths and final answer predictions.
Improving from self-feedback.Several papers have proposed to improve LMs using feedback from themselves.For example, Zelikman et al. (2022) proposes to train a model with its selfgenerated reasoning steps that result in itself making the correct final predictions.Huang et al. (2022) chooses which self-generated reasoning chains to train on by selecting the high-confidence, selfconsistent ones.Both papers use supervised loss to improve the model.To our best knowledge, we are the first to improve models from self-feedback using RL.Concurrent to our work, Madaan et al. (2023) proposes an inference-time method to improve text generation by taking an LM's own feedback on the output, and yet it relies on the emergent behavior of LLMs, whereas CRYSTAL improves through RL and can be applied to smaller LMs to achieve higher performance than larger LLMs.

Conclusion
We develop a method to build introspective reasoners that achieves superior performance and good interpretability on commonsense reasoning tasks.
Compared with prior literature, our method comprehensively accounts for the introspective nature of knowledge required in commonsense reasoning, and the mutual adaptation of knowledge introspection and knowledge-grounded reasoning.Our approach highlights the feasibility and benefit of training neural models with self-feedback.

Limitations
CRYSTAL is intended to solve commonsense QA problems, and its performance on noncommonsense applications is unknown and thus requires further investigation.There is also a limit on the length of knowledge it generates in our experimental setting, and it has not been tested on generating long and coherent text.Extra care should be taken when applying our model in production environments, especially when making critical decisions or exposing its generated contents directly to human end users.

Figure 1 :
Figure 1: Top: CRYSTAL performing introspective reasoning on a commonsense question.The model first uses its knowledge introspection mode to generate relevant knowledge statements, then invokes a knowledgegrounded reasoning mode to predict an answer based on the introspected knowledge.Bottom: chain-ofthought prompting on the same question (generated by text-davinci-003 with original few-shot prompts in Wei et al. (2022)).The intermediate steps fail to provide meaningful insight into the reasoning process.

Figure 2 :
Figure 2: Illustration of the interleaved optimization schedule for both training stages.In training stage I, during each cycle, L QK is optimized for S QK iterations, and then L QKA is optimized for S QKA iterations.Progressing to training stage II, during each cycle, L PPO is optimized for S PPO iterations, and then L QKA is optimized for S QKA iterations.

Figure 3 :
Figure 3: Expert annotation on the relationship between the introspected knowledge and the final prediction.
company's budget at the business meeting but the _ was boring and the topic of the budget ran long.(A) budget (B) meeting A The topic of the meeting was boring.B PIQA Find spices easily in the kitchen.(A) Arrange spices from hot to mild in the kitchen in order to find them by taste.(B) Arrange your spices alphabetically to make finding them easy.A A spice alphabet is used to find spices.B QASC What comes from cows? (A) pork (B) can be organic (C) holding nutrients (D) drinking water (E) rice (F) antigens (G) nutritious fluid (H) corn A Cows produce milk.G CSQA Paul wants carrots and doesn't need to drive anywhere.He gets them from where? (A) refrigerator (B) store (C) farmer's market (D) supermarket (E) dryer D Carrots are stored in the refrigerator.A OBQA Frilled sharks and angler fish live far beneath the surface of the ocean, which is why they are known as (A) Deep sea animals (B) fish (C) Long Sea Fish (D) Far Sea Animals D Deep sea animals are found in the ocean.A ARC_e An anemometer is a tool that measures (A) wind direction.(B) wind speed.(C) air pressure.(D) air temperature.B An anemometer measures wind speed and direction.C

Question:
How are the particles in a block of iron affected when the block is melted?\n (A) The particles gain mass.(B) The particles contain less energy.(C) The particles move more rapidly.(D) The particles increase in volume.Knowledge: Iron particles are affected by heat.Prediction: (C) Unrelated The knowledge is unrelated to the question.-Contradict The knowledge can be part of a reasoning chain that refutes the predicted answer, or supports a different choice.Question: I need what to calculate the length from my big toe to my little toe? \n (A) Calculator (B) Tape Measure (C) A Graph (D) A Microscope Knowledge: A calculator is used to calculate lengths.Prediction: (B)

Table 3 :
Results on unseen datasets.Accuracy on the development set is reported.

Table 5 :
Theoretical memory and time consumption of PPO training in CRYSTAL.s is the number of minor steps in each PPO iteration.(In our experiments, we use s = 4.)

Table 6 :
Empirical memory usage and speed of training CRYSTAL (stage II).Experiments are conducted on V100 GPUs.Fully sharded data parallel (FSDP) and bfloat16 mixed precision are enabled.

Table 7 :
Ablations on the RL training stage (i.e., stage II).Average accuracy on the development set of seen datasets is reported.

Table 8 :
Ablations on the interleaved training objectives.Average accuracy on the development set of seen datasets is reported.

Table 10 :
Hyperparameter settings.Who watches a play in an auditorium?\n (A) building (B) crowd (C) city (D) group (E) high school Knowledge: Audiences watch plays in auditoriums.The movement of crustal plates results from circulating currents in material beneath the crust of Earth.Which best describes the material which moves the crustal plates?\n (A) hot water (B) molten rock (C) liquid metal (D) solid iron Knowledge: The movement of crustal plates is caused by circulating currents in material beneath the crust of Earth. Question:Question:

Table 11 :
Instruction for the human evaluation.

Table 12 :
Comparison of existing knowledge-augmented reasoning methods.Unified model: if the method employs a unified model (rather than separate) for knowledge generation and knowledge-grounded reasoning.KG => KR: if the knowledge-grounded reasoning is trained to adapt to knowledge generation.KR => KG: if the knowledge generation is trained to adapt to knowledge-grounded reasoning.