Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering

Knowledge underpins reasoning. Recent research demonstrates that when relevant knowledge is provided as additional context to commonsense question answering (QA), it can substantially enhance the performance even on top of state-of-the-art. The fundamental challenge is where and how to find such knowledge that is high quality and on point with respect to the question; knowledge retrieved from knowledge bases are incomplete and knowledge generated from language models are inconsistent.We present Rainier, or Reinforced Knowledge Introspector, that learns to generate contextually relevant knowledge in response to given questions. Our approach starts by imitating knowledge generated by GPT-3, then learns to generate its own knowledge via reinforcement learning where rewards are shaped based on the increased performance on the resulting question answering. Rainier demonstrates substantial and consistent performance gains when tested over 9 different commonsense benchmarks: including 5 datasets that are seen during model training, as well as 4 datasets that are kept unseen. Our work is the first to report that knowledge generated by models that are orders of magnitude smaller than GPT-3, even without direct supervision on the knowledge itself, can exceed the quality of commonsense knowledge elicited from GPT-3.


Introduction
Commonsense is a significant challenge for modern NLP models, due to the obscurity of underlying knowledge that grounds the reasoning process.While humans are generally able to introspect the underlying reasons for their conclusion (Mercier and Sperber, 2017), neural models lack the capability to verbalize the premises leading to their prediction.This hinders models' performance and robustness on commonsense tasks, and makes it difficult Figure 1: RAINIER can introspect for commonsense knowledge that underpin the reasoning process, and is trained via reinforcement learning, where the reward is derived from the effectiveness of knowledge when prompting a frozen, generic QA model.
to inspect their point of failure.Recent research has demonstrated that relevant knowledge can provide useful context for approaching commonsense tasks.Yet these methods either retrieve from indomain knowledge bases (Mitra et al., 2019;Chang et al., 2020) that do not have good coverage over commonsense, or generate knowledge from neural models (Shwartz et al., 2020;Gu et al., 2022;Liu et al., 2022), which often need domain-specific engineering and very large models (e.g.GPT-3 (Brown et al., 2020)).It is still an open challenge to systematically find high-quality knowledge.
In this work, we use a novel, reinforcementlearning-based method to develop RAINIER, a generative neural model that can introspect the underlying knowledge for reasoning about commonsense questions.As illustrated in Fig 1 , RAINIER is trained to generate knowledge that are both fluent natural language statements, and useful prompts that optimize the performance of a generic question answering (QA) model.Our model (1) demonstrates strong generalization to unseen benchmarks without additional engineering effort (e.g.finetuning), (2) produces commonsense knowledge of high quality and diversity, and (3) is substantially smaller in size compared to GPT-3, the best knowledge source reported so far.
To train RAINIER, we optimize knowledge introspection for the resulting QA, instead of direct supervision, because there are usually no gold knowledge labels on commonsense datasets.In order to ensure that our model learns to generate generically useful knowledge for a broad range of QA models, we train only RAINIER, the knowledge introspector, without finetuning the QA model.Since our desired knowledge are sequences of discrete, non-differentiable word tokens, we adapt a reinforcement learning algorithm, Proximal Policy Optimization (PPO) (Schulman et al., 2017;Ouyang et al., 2022), to optimize the knowledge introspector.Specifically, the reward is defined as the effect of RAINIER-generated knowledge on the QA model's prediction.We train RAINIER in a multitask setting on 8 commonsense QA datasets -encompassing general, scientific, physical, and social commonsense -to equip the model with better generalization to unseen benchmarks.
Experiments show that RAINIER substantially improves the performance of QA models on 9 commonsense benchmarks (5 datasets seen during training and 4 unseen datasets), and gives larger and more consistent gains than a few-shot GPT-3 (Liu et al., 2022) despite being 16x smaller in parameter size.It also boosts the performance on top of those QA models that it is not trained against, indicating that it generates generically useful knowledge instead of merely hacking into the reward given by a single QA model.Knowledge generated by RAINIER can even boost a QA model that is 4x larger than it, showing the promise of using modelgenerated knowledge as a complement to model scaling in making progress in commonsense reasoning.Our analyses show that the knowledge generated by RAINIER are of high quality, and are diverse in terms of domain (e.g.scientific, social), relation expressed (e.g.part of, member of, purpose), and syntactic property (e.g.negation, comparison).The effect of these knowledge on the QA model also aligns well with human judgments.The success of RAINIER shows that moderately-sized models can serve as source of high-quality and useful commonsense knowledge that facilitates reasoning.We publicly release the code, the trained RAINIER model, and the commonsense datasets extended with knowledge generated by RAINIER.Problem Overview.We focus on the tasks of multiple-choice commonsense QA, consisting of instances of format x = (q, A, a * ), where q is the question, A is the set of candidate answers, and a * ∈ A is the correct answer.For full contextualization, we append candidate answers A to the question q to form the input to the QA model as follows:

Algorithm 1 Training RAINIER
Common approaches only train supervised QA models.As a complement, we train a separate model, which we refer to as RAINIER, that can introspect question-specific knowledges that are useful to prompt a fixed QA model.RAINIER is a sequence-to-sequence language model, p K (k|q; θ), and we expect it to generate knowledge statements (k's) in response to the given question (q).However, the challenge is that we have no gold knowledge labels as supervision.
Training.Since we do not have gold knowledge to train RAINIER, we obtain this model by finetuning a pretrained language model in two stages: (I) imitation learning, and then (II) reinforcement learning.In Stage I ( §2.1), we get silver knowledge labels on some datasets from GPT-3, and teach our model to imitate this knowledge-generating GPT-3.This equips our model with the basic functionality of knowledge generation.In Stage II ( §2.2), we use reinforcement learning to continue training the model obtained in Stage I to make the generated knowledge more useful while staying fluent and meaningful.Specially, we set the reward to be the effect of the generated knowledge on the prediction made by a fixed, generic QA model.We obtain silver knowledge and train RAINIER on the union of multiple QA datasets (which are considered seen during training), i.e.D seen = ∆seen d=1 D d , where . The generic QA model we use may or may not have been trained on these seen datasets.The complete training process is outlined in Algorithm 1.
Inference.The effectiveness of RAINIER is evaluated against a set of unseen QA datasets, D unseen , in addition to the seen datasets.Note that RAINIER is not trained on any unseen datasets, which means we neither get silver knowledge, nor do imitation learning or reinforcement learning on them.The generic QA model we use was not trained on any unseen datasets as well.We discuss details of inference in §2.3.

Training Stage I: Imitation Learning
In Stage I, we train RAINIER so that it generates fluent and meaningful natural language statements that resemble knowledge.There is no large-scale commonsense QA dataset labeled with high-quality knowledge, but GPT-3 has been shown as a good generator for relevant knowledge (Liu et al., 2022).Therefore, we get silver knowledge from GPT-3 on our seen datasets.Following Liu et al. (2022), we elicit question-related knowledge by prompting GPT-3 with a task-specific set of few-shot demonstrations (See §C for details on the prompts), and decoding M knowledge for each question: where p G (•|•) denotes GPT-3 with nucleus sampling where p = 0.5 (Holtzman et al., 2020).This yields a silver dataset of question-knowledge pairs: We then train RAINIER, starting from a pretrained sequence-to-sequence language model, on this silver dataset with standard supervised loss: (2) The parameterization of the resulting model is denoted as θ imit .

Training Stage II: Reinforcement Learning
As we will see in the empirical results, the imitation model obtained in Stage I does not provide the most beneficial knowledge.Therefore, in Stage II, we continue optimizing RAINIER to generate knowledge that best prompts the QA model, by directly maximizing the reward given by this QA model.
Knowledge generation as reinforcement learning.Since knowledge statements (k's) are discrete and thus non-differentiable, we adopt a reinforcement learning approach, and consider knowledge generation as a sequential decision making process over the natural language vocabulary space.
We consider the generation of knowledge statement k with T tokens as an episode of length T .At step t ∈ [1, T ], the state s t = (q, k <t ) is the combination of the question and the knowledge decoded up to the (t − 1)-th token; the action a t = k t would be the t-th token to decode.The RAINIER model, p K (k t |q, k <t ; θ), is the policy model that we optimize.We define a reward function r(x, k) that characterizes the effect of the knowledge on the QA model's prediction, and discuss the definition of this reward function in §2.2.1.
To ensure that the generated knowledge stay fluent and meaningful, we would like the learned policy model not to move too far from the initial imitation model.Therefore, we add to the reward an (approximate) KL penalty between the learned policy and the initial policy (Ouyang et al., 2022), Since this reward is computed based on the full knowledge statement, we assign it to the last step of the episode.Non-terminal steps are assigned zero rewards.Formally, We employ Proximal Policy Optimization2 (PPO) (Schulman et al., 2017) as our reinforcement learning algorithm, and adapt from the implementation of PPO in Ouyang et al. (2022).Aside from the policy model, PPO additionally uses a value model (parameterized by ϕ) to estimate the value function for states with incomplete decoded text, i.e.V (s t ; ϕ) for any t.PPO minimizes a joint loss, where L Policy (θ) is the loss on the policy model, L Value (ϕ) is the loss on the value model, and α is a hyperparameter.
Policy loss.To obtain the policy loss, we first compute the truncated estimated advantage function, where the value functions V (•) are estimated by the value model.PPO then maximizes the empirical expectation of a so-called clipped surrogate objective term, where ν t (θ) = p K (kt|q;θ) p K (kt|q;θ old ) is the ratio between the current policy θ and a lagging policy θ old .The lagging policy is updated to the current policy under a fixed interval of s training steps, and is kept fixed otherwise.We adapt this to our use case, and define the policy loss as where the expectation is taken over all instances in the training data (x ∼ D train seen ), the distribution of model-generated knowledge as determined by the current policy conditioning on the instance's question (k ∼ p K (k|q; θ)), and all tokens in the knowledge statement (t ∈ [1, |k|]).
Value loss.The value model is trained with MSE loss with respect to the target value, V targ t , which in turn is estimated with a lagging value model ϕ old :

Reward Shaping
We define the reward function in reinforcement learning as the quantified effect of RAINIER's knowledge on the QA model's prediction.Suppose we already have a reasonably good QA model, which assigns a probability score P QA (a|q) to any candidate answer a ∈ A. Since we will use a sequence-to-sequence language model (i.e.Uni-fiedQA (Khashabi et al., 2020)) as the QA model, we define , where where p QA (a i |q, a <i ; ψ QA ) is the language modeling score received by a i , the i-th token of a.The naive prediction would be the candidate answer that gets the highest P QA (a|q) (or equivalently, the highest S QA (a|q)): â = arg max a∈A P QA (a|q).
We aim at maximizing P QA (a * |q • k), the probability score received by the correct answer when the QA model is prompted with the knowledge k generated by RAINIER, and • denotes text concatenation.One naive definition of reward function may be However, this reward only captures the absolute change of score, but not whether the model prediction is changed or not.To remedy for this, we define the reward function as Intuitively, this function would give a reward of near +1 if the naive prediction is incorrect (i.e. S QA (a * |q) < max a ′ ∈A,a ′ ̸ =a * S QA (a ′ |q)), while the knowledge-prompted prediction is correct (i.e.
Similarly, the reward would be near −1 if the naive prediction is correct but the knowledge-prompted prediction is incorrect.The hyperbolic tangent serves as a smoothed sign function, and provides a soft interpolation between the two polarity of reward values by taking into account the margin of the correct answer.
We also experiment with some alternative definitions of the reward function.See Table 4.
Reward normalization.To stabilize training, we apply an affine transformation on the rewards so that initially they are normalized.Before starting Stage II training, we use the imitation model to generate a knowledge statement for each training instance, and estimate the population mean and standard deviation of rewards: In Stage II training, each reward is normalized as: (5)

Inference: Knowledge Prompting and Aggregation
Following Liu et al. (2022), at inference time we use RAINIER to generate multiple knowledge per question, and prompt the QA model by individually concatenating each knowledge to the question.The knowledge are generated by RAINIER with nucleus sampling where p = 0.5 (Holtzman et al., 2020), where M is the number of knowledge per question, and ε denotes empty string.We collect a set of outputs for prompting with each knowledge.The final prediction is the candidate answer that receives maximum confidence, and the prediction is supported by a single knowledge -the selected knowledge, Training time model selection.In Stage II training, we only generate one knowledge per question for the validation set.3Predictions are made using the same knowledge prompting method as above, and the model checkpoint with the maximal accuracy on the union of all validation sets is selected.
Models.For Stage I training, we get silver knowledge from the GPT-3-Curie (13B) model (Brown et al., 2020).The knowledge introspector is initialized with T5-large (Raffel et al., 2019), which has 0.77B parameters.For Stage II training, we initialize the value model with T5-large, and replace the language modeling head with a value regression head, which is initialized from scratch; we use UnifiedQA-large (UQA-large) (Khashabi et al., 2020) as the QA model that provides reward, which means the text concatenation function is defined as q • k = {q} \n {k}.We use the same question formatting as UnifiedQA.See Table 7 for hyperparameters.
Baselines.We mainly report performance improvements over the vanilla QA baseline (i.e.direct inference with the UnifiedQA-large model and without prompting RAINIER-generated knowledge).We also consider using knowledge from: • Few-shot GPT-3 (Liu et al., 2022), where knowledge statements are elicited from the GPT-3-Curie (13B) model -under the same prompts used for getting silver knowledge in Stage I training ( §2.1), and the same hyperparameter setting for decoding (M = 10 knowledge per question, with nucleus sampling where p = 0.5).
• Self-talk (Shwartz et al., 2020), where we generate M = 10 knowledge per question with GPT-3-Curie and a variety of templates.• DREAM (Gu et al., 2022), where we generate M = 10 scene elaborations per question with the DREAM (11B) model.See §A.2 for more details on these baselines.We do not compare with chain-of-thought prompting (Wei et al., 2022) because it relies on emergent behaviors that does not exist in the scale that we experiment with.

Main Results
Performance on seen datasets.Table 1 shows the performance of RAINIER-enhanced QA model on the seen datasets.On average, our method achieves more than 5% improvement over directly applying the QA model.The knowledge generated by RAINIER improves performance on five benchmarks: CommonsenseQA, QASC, Physi-calIQA, SocialIQA, and Winogrande, with the greatest improvement on CommonsenseQA (+6%) and QASC (+12%).As shown in Table 8, there is no performance gain on OpenBookQA, ARC, and AI2Science.We conjecture that this is because the QA model, UnifiedQA, is already trained on these three datasets, thus setting a strong baseline.
Comparison with other models.Compared to RAINIER, other knowledge generation models, including few-shot GPT-3, Self-talk, and DREAM, provide generally weaker improvements over the vanilla QA baseline.In particular, RAINIER outperforms GPT-3-based models while being 16x smaller in parameter size (0.77B vs. 13B).
Performance on unseen datasets.Table 2 shows that RAINIER's knowledge substantially improves performance over the vanilla QA baseline on the four unseen datasets, demonstrating its generalization capability.
Choice of QA model for evaluation.To verify that our RAINIER model is not hacking into the rewards provided by the QA model we use during training, we evaluate the effect of RAINIER's knowledge on different QA models.We choose three other UnifiedQA models with different sizes, as well as a different model known as Unicorn (Lourie et al., 2021).Results are shown in Figure 2. RAINIER consistently gives performance gains on top of all QA models, indicating that its knowledge are generally useful information rather than mere artifacts of model-specific reward hacking.We even observe performance gains with a QA model that is 4x as large as RAINIER, which means generating and prompting relevant knowledge can be a technique complementary to model scaling, and can be done meaningfully with smaller models.Finally, we see the largest improvement when the QA model itself has weak, but non-trivial, performance (UnifiedQA-small for seen datasets, and UnifiedQA-base for unseen datasets).Reward function.Table 4 shows the results for knowledge introspectors trained with different reward functions.Our reward shaping gives the best performance on unseen datasets, as well as one of the top performance on seen datasets.While the naive prob diff reward function gives slightly better performance on seen datasets, our reward shaping results in better generalization.

Analysis
To get a deeper understanding of the behavior and capability of RAINIER, we manually analyzed the generated knowledge along several quality and diversity aspects.We asked three NLP experts to annotate the selected knowledge ( §2.3) for up to 100 questions per dataset among the validation sets of 8 benchmarks (5 seen, 3 unseen; see Figure 3).It was hidden from the annotators whether the knowledge rectifies or misleads QA model's prediction, so potential bias is eliminated.
Quality.First, we follow Liu et al. (2022) by annotating the quality aspects -relevance, factuality, and helpfulness -of each knowledge with respect to the question.We find that RAINIER-generated knowledge are overwhelmingly related to the respective questions.64% are factually correct, 25% are factually incorrect, and the remaining 11% have undetermined factuality due to various reasons (e.g.ambiguity, cultural sensitivity).58% are seen by human as being helpful for reasoning about the question, whereas 24% are seen as harmful.
In our annotations, there are 420 knowledge that rectify UnifiedQA-large's predictions (i.e.flipping from wrong to right), and 246 knowledge that mislead the predictions (i.e.flipping from right to wrong).Among the rectifying knowledge, 84% are deemed helpful by human; and among the misleading knowledge, 62% are deemed harmful.These results have similar trends as Liu et al. (2022), and show that RAINIER's knowledge are of high quality and interpretability in helping QA models.
Diversity.Additionally, we analyze the diversity aspects by annotating each knowledge with the domain(s) it belongs to (e.g.scientific, social), the relation(s) it expresses (e.g.attribute, capable of), and its syntactic property(s) (e.g.negation, comparison).See Figure 3 for complete list of options under each aspect.The knowledge's domain distribution is strongly tied to the domain of the benchmark (e.g.scientific for QASC and QuaRTz, social for SocialIQA and Winogrande, numerical for Nu-merSense).The domain aspect is more diverse for benchmarks that test general commonsense, like CommonsenseQA and RiddleSense.For the relation aspect, there are many knowledge that express an "attribute" relation, while other relations are also substantially represented.As for syntax, a good proportion of the knowledge contain structures like comparison and negation.Therefore, RAINIER's knowledge have good syntactic and semantic diversity while being able to adapt to the domain.

Qualitative Examples
We show some examples of good knowledge generated by RAINIER in Table 5.

Related Work
Explicit reasoning for commonsense QA.Commonsense question answering poses a significant challenge to modern neural models.To improve performance and interpretability, many work have proposed to do explicit reasoning for tasks in this area, that is, to verbalize the intermediate text artifacts that facilitate the reasoning process.Rajani et al. (2019) and Latcinnik and Berant (2020) use supervised learning to train models to generate text explanations, while Gu et al. (2022) and Bansal et al. (2021) 2022) distills this capability into smaller models using supervised learning.These methods provide more flexibility on the knowledge, yet they rely on accessing very large language models (e.g.GPT-3).Aside from methods that make reasoning explicit in a linear chain manner, another set of work produce recursive structures of reasoning, through either backward chaining (Dalvi et al., 2022;Jung et al., 2022) or forward chaining (Bostrom et al., 2022).Our work contributes to this line of research, yet we depart from prior work by presenting the first approach that learns to generate relevant knowledge without requiring human-labeled gold knowledge.

Conclusion
We introduced RAINIER, a neural model that can introspect for relevant knowledge on a broad range of commonsense question answering tasks.RAINIER is trained with a novel adaption of reinforcement learning, and does not need gold knowledge labels that are difficult to obtain.Knowledge generated by RAINIER can serve as useful prompts that improves the performance of QA models on both seen and unseen benchmarks, and outperform knowledge elicited from a few-shot GPT-3 which is 16x bigger.RAINIER generates knowledge in the form of natural language statements that are fluent, meaningful, high-quality, and diverse in terms of domain and relation; furthermore, the effect of these knowledge on the QA model is found to align well with human judgments.

Limitations
Despite the positive effect of our knowledge introspector RAINIER on commonsense QA tasks, its performance on non-commonsense applications is unknown and thus requires further investigation.Even for commonsense applications, there is still a large gap between model performance and human performance, so the resulting model is not ready for real-world applications.There is also a limit on the length of knowledge it generates in our experimental setting, and it has not been tested on generating long and coherent text.Furthermore, in some cases it may generate knowledge that express inappropriate social values (Table 10), are culture-specific (Table 11), or contain ethical risks (Table 12).See §B for examples.Extra care should be taken when applying our model in production environments, especially when making critical decisions or exposing its generated contents directly to human end users.The fruit of the year is the date.

Ambiguous
The knowledge does not specify which year, so it is not possible to verify its factuality.
QuaRTz Carla is pushing two carts down a street.One cart is heavy, the other is light.The heavy cart, compared to the light one, will accelerate (A) slowly (B) quickly The heavy cart will accelerate slower than the light one.

Under-specified
Missing the control factor that the forces exerted on the carts are the same.Though this flaw also exists in the question itself.

RS
What has a head at night but not in the morning?(A) sleeping (B) pillow (C) neck (D) shampoo (E) beer Sleeping animals have a head at night.

Over-specified
It would be normal to say "animals have a head".It is weird to qualify with "sleeping" and "at night".
WG Because William developed a sore knee during his race against against Kyle, _ won the race.
(A) William (B) Kyle William was not able to run the race.
Over-confident William might still be able to finish the race with a sore knee.
RS what is the begining of enternity the end of life the end of time and the end to every race (A) e (B) quick (C) finality (D) fix (E) habit The end of every race is the end of every race.

Tautological
This knowledge does not give any meaningful information.
QuaRTz Sharon is conducting an experiment on valence electrons and soon discovers that when they are closer to the nucleus, they are _____ easily removed from the atom.(A) more (B) less Valence electrons are more prone to being removed from the atom.

Not applicable
This knowledge implicitly compares the removability of valence vs. non-valence electrons.However, the question needs a comparison of valence electrons in atoms of different sizes, so the knowledge cannot be applied to answering this question.

4. 2
Ablations Stage I and Stage II training.We experimented with omitting the Stage I (imitation) and/or Stage II (reinforcement) from the training pipeline.Results are shown inTable 3. Without Stage I training,

Figure 2 :
Figure 2: Effectiveness of RAINIER-generated knowledge on different QA models.Average accuracy on dev sets is reported.(Note: results of few-shot GPT-3-Curie on Unicorn-large is missing.)

Figure 3 :
Figure 3: Human analysis of RAINIER-generated knowledge.Left: Percentage of good knowledge in each quality aspect.Mid: Agreement between human and machine on helpfulness of selected knowledge.Right: Percentage of RAINIER-generated knowledge categorized by domain, expressed relation, and syntax.The percentages do not add up to 100% because some knowledge have none of these characteristics, while some others may have multiple.
use similar training regimes to obtain models that can generate scene elaborations and paths through a structured knowledge graph, respectively.Shwartz et al. (2020) and Paranjape et al. (2021) prompt pretrained models with pre-defined templates to generate question clarifications or contrastive explanations, which are in turn used to prompt the inference model.The above approaches all pose, implicitly or explicitly, certain constraints (e.g.domain, relation, syntax) on the model-generated text.In contrast, Wei et al. (2022) elicits full chain-of-reasoning from language models with in-context learning; Liu et al. (2022) uses few-shot demonstrations to elicit flexible, relevant knowledge statements from a language model, and Wang et al. ( Input initial policy model θ0, initial value model ϕ0, pretrained QA model ψQA Dimit ← Get silver knowledge on Dseen from GPT-3.θimit ← Optimize θ0 with Eqn 2 from Dimit.▷ Section 2.1

Table 2 :
Results on unseen datasets.

Table 3 :
Ablations on the importance of both training stages.

Table 4 :
Ablations on the choice of reward function.

Table 5 :
Examples of good knowledge generated by RAINIER.Each of these knowledge rectifies UnifiedQA-large's prediction, and is labeled by the annotator as relevant, factual, and helpful.
Learning rate of Adam optimizer (with a linear learning rate decay schedule).

Table 8 :
Results on the other 3 seen datasets.All experiments use UnifiedQA-large as the QA model, and optionally uses knowledge from one of the knowledge generation models.On these datasets, RAINIER-generated knowledge does not give an improvement over the vanilla QA baseline.

Table 9 :
Examples of knowledge generated by RAINIER which are semantically problematic.SIQA Riley broke loose from the house.He thought he would never get out of there.Why did Riley do this?(A) Stay in the house longer (B) think about his life (C) go home for a while Breaking out of a bad habit is usually a bad idea.Social value This knowledge is a generally true statement, so we labeled it as factual.SIQA Tracy heard a faint buzzing noise and immediately ran for her life.How would you describe Tracy?(A) scared of bees (B) sad (C) not phased by bees One should not be scared of bees.Social value It is hard to decide whether this knowledge should be considered factual or not.SIQA Remy gave Skylar's Netflix account password to one of Remy's other friends.How would Skylar feel as a result?(A) like a bad friend (B) excited (C) used A friend can be used by a friend.Social value It is ambiguous whether can means it is possible that ..., or ought to.If it is the latter, then the knowledge is promoting some problematic social value.SIQA Riley was the best of friends with the boy with cancer.What will Riley want to do next?(A) visit the hospital (B) shun the friend (C) become friends with the boy with cancer too One should visit their sick friend.Social value It is generally a kind thing to visit a sick friend.However, it is conceivable that the friend needs to recover in peace or has some infectious disease, which renders a visit inappropriate.SIQA Carson tried to fight Robin last night because Robin hurt Carson a lot.What will Carson want to do next?(A) apologize (B) do nothing (C) hurt Robin One should apologize when they hurt someone.Bailey told Alex to send the pdf because they didn't want to do it themselves.How would Alex feel as a result?(A) lazy about work (B) happy (C) angry One should be willing to help others.Social value This knowledge is generally accepted, but it is not a good fit to the question's context.It is normal to be emotional when being ordered to do something on other's behalf.SIQA Kendall wrapped a bandage around my neck after getting injured in a fight.What will Kendall want to do next?(A) harm them (B) punish them (C) protect them One should help others in need. SIQA

Table 10 :
Examples of knowledge generated by RAINIER that express some social value.