Comprehension Based Question Answering using Bloom’s Taxonomy

Current pre-trained language models have lots of knowledge, but a more limited ability to use that knowledge. Bloom’s Taxonomy helps educators teach children how to use knowledge by categorizing comprehension skills, so we use it to analyze and improve the comprehension skills of large pre-trained language models. Our experiments focus on zero-shot question answering, using the taxonomy to provide proximal context that helps the model answer questions by being relevant to those questions. We show targeting context in this manner improves performance across 4 popular common sense question answer datasets.


Introduction
Recent large language models such as GPT-3 (Brown et al., 2020) have made a giant leap forward in knowledge acquisition and even generalize this knowledge to a new tasks. But when less narrow tasks are considered they fail to understand as much as these benchmarks suggest. They turn out to be "stochastic parrots" (Bender et al., 2021) or "smart/super parrots." (Dunietz et al., 2020) that just memorize without all of the comprehension we want from a Natural Language Understanding system. We focus on a particular kind of failure mode where the model knows (has memorized) the information it needs, but is not able to apply that information correctly, and we do so in a zero-shot fashion to control for what the model knows.
For example, in Fig. 1 the model is asked if a mixture of grape juice and cranberry juice is safe to drink (Marcus and Davis, 2020). GPT-3 declares that it is a deadly poison, even though it appears to "know" that grape juice and cranberry juice are safe to drink by themselves (Fig. 1, Level 1, dark purple). It even knows that cranberry juice with grape juice is not poisonous, but it still thinks the result is death (Fig. 1, Level 2, light blue). The model has * These two authors contributed equally. memorized the necessary information from large amounts of text, but does not use its knowledge appropriately. Following (Shwartz et al., 2020), we extract this knowledge as explicit language then feed it back as additional context during inference, forcing the model to use what it already knows but in our case targeting specifically useful knowledge.
To formalize this distinction we drew inspiration from elementary school classrooms, where teachers (Miller, 2002;Harvey and Goudvis, 2007) have a schema based approach in which they teach children to demonstrate multiple levels of comprehension, making complex inferences and direct recall from memory. They use a hierarchy of comprehension skills called Bloom's Taxonomy (Anderson et al., 2000) (c.f . Fig. 1) with memorization is at the bottom (requiring children to recall facts) followed by understanding (requiring children to grasp semantics) application (requiring children to solve problems), and more complex skills. For us, these comprehension skills describe ways our language model might fail to use its knowledge.
In this paper we address our failure mode by relying on commonly understood relationships between the skills of Bloom's Taxonomy which we term proximal context. In order to understand whether the cranberry grape mixture is poisonous the model needs to remember whether grape juice is poisonous. In order to apply its knowledge to figure out what will happen next it needs to understand whether the cranberry grape mixture is poisonous or not. In general, the proximal context for a particular task T at level L is given by those tasks implicitly required by T , which are mostly at level L − 1 of the taxonomy. We guide our language to answer questions more accurately by providing it not just any context, but proximal context 1 . In performing zero-shot question answering our language model asks itself additional clarifica- Our contributions in this paper are: • We use Bloom's Taxonomy to choose proximal clarifying context that improves question answering performance using only what the model already knows.
• We show proximal context is better than other levels of context on four different commonsense question answering tasks.
• By observing how different levels of clarification impact our language model we also explain how the model answers questions.

Related Works
Question Answering from External Supervision. Several approaches has been proposed to improve question-answering by adding external knowledge source. Recent large pre-trained language models (Peters et al., 2018;Radford et al., 2019;Devlin et al., 2018;Liu et al., 2019;Joshi et al., 2020;Clark et al., 2020) learn general purpose text encoders from a huge text corpus. (Petroni et al., 2019) recently used a language model as knowledge base to unmask a token given an entity and a relation in a predefined template. Shwartz et al. (2020); Bosselut et al. (2019a,b) used pretrained language models to improve zero-shot question answering performance by extracting context from the language model itself, using self-talk or a knowledge graph. We add context via self-talk, with structure provided by Bloom's Taxonomy.
Bloom's Taxonomy. The original work (Bloom, 1956) defined taxonomies for learning in the cognitive (intellectual), affective (interests, attitudes, values), and psychomotor domains, though the cognitive domain is what we usually refer to today. Almost half a century later the cognitive domain taxonomy was revised (Anderson et al., 2000) to reflect more active thinking and improve usability by adding verbs to describe levels. Teachers use this taxonomy, for example in computer science education (Whalley et al., 2006;Thompson et al., 2008;Oliver et al., 2004), and our inspiration is from this revised version of the cognitive taxonomy. Machine learning has been applied to automatically classify questions (Mohammed and Omar, 2020;Zhang et al., 2021;Nafa et al., 2016) into Bloom's Taxonomy levels, but the taxonomy has not been applied to analyze or improve machine learning models themselves. We use it to help our model think about what it knows.

Approach
Our approach builds on the zero-shot question answering approach of Shwartz et al. (2020) to answer questions (Section 3.1) by adding clarifications with self-talk (Section 3.2). We describe this approach then we use Bloom's Taxonomy to select better clarifications (Section 3.2).

Question Answering with Language Models
Given a prompt p, a question q, and answer options a o ∀o ∈ [1, K] we use a pre-trained language model LM to pick the correct answer a o * . This What is the main purpose of : 2 (a) What is the main purpose of this investigation? (b) What is the main purpose of this post?
(a) The purpose of this investigation is to provide information about how and why he was shot.
(b) The purpose of this post is to share my thoughts and feelings on his death.

CommonsenseQA
What is the main function of a : 2 (a) What is the main function of a teacher in this area? (b) What is the main function of a farmer?
(a)The main function of a teacher in this area is to teach them about life and love. (b) The main function of a farmer is to provide food for his family and the community.
What might have caused : 3 (a) What might have caused this problem? (b) What might have caused the animal to flee?
(a) the cause of this problem was that his wife's husband didn't have enough money. (b) The cause of the animal to flee was a predator.

Social IQA
The language model's answer is just the answer with the highest score:ô = argmax o s o .

Self-talk Clarifications
Self-talk (Shwartz et al., 2020) has a language model ask itself clarifying questions then answer those questions to generate clarifications. Stage 1: Ask clarification questions. To produce clarifications we start with a set of clarification question prefixes r 1 , . . . , r J that are designed specifically for each question answering dataset. "What happens if" is a sample prefix for the clarifications, shown in Fig. 1, and in Tab. 1 we present examples for all the datasets we use. In this stage the language model completes each of these prefixes, using its generator function LM G to ask one question R j = LM G (r j ) per prefix.
Stage 2: Answer the questions. Next we use the model to answer each of these questions, possibly prompted with an answer prefix b j corresponding to question prefix r j . The results are the clarifi- . This can improve question answering performance on its own, but in the next section we more carefully choose clarifications using our notion of proximal context and Bloom's Taxonomy.

Using Bloom's Taxonomy to Choose Clarifications with Proximal Context
To test our idea of proximal context we consider the level L of task give by each dataset then allow only proximal clarifications of level L − 1. We label each question prefix with the level of Bloom's Taxonomy that it falls into, and then force the model to choose from the set C L of clarifications of level L. This results in a final choice for each level o * L = argmax o max j∈C L LM (T j,o ). We also provide a Choice Baseline that allows the model to choose any level of clarification to show the model would have difficulty choosing proximal clarifica-tions itself. Note that the annotation of questions along Bloom's taxonomy requires special skills typically found only among educators. While a layperson can be trained to annotate such questions, our experience was that it takes much more time than we could afford for a preliminary study such as this one. We therefore relied on our coauthor, Sara Rutherford-Quach, who is a researcher at SRI's Education Division and has also worked as a teacher at the kindergarten-elementary level to provide us the annotations. Two other co-authors, Sahu and Cogswell, went through those annotations and made sure that each label had a three way consensus among Rutherford-Quach, Sahu and Cogswell. There might be some ambiguity about which level a particular prefix fits into, but this is also true of other applications of the taxonomy (Thompson et al., 2008). In future work, we plan to carry out a more rigorous annotation with more than one skilled annotator so we can measure inter-annotator agreement through measures such as Kappa scores.

Datasets
We evaluate our study on four datasets that can each be thought of in terms of multiple choice question answering, all measuring some kind of common sense: COPA (Roemmele et al., 2011) measures common sense causal reasoning, Com-monSenseQA (Talmor et al., 2019) asks questions that require prior knowledge, Social IQA (Sap et al., 2019) asks about social common sense, and Wino-Grande (Sakaguchi et al., 2020) adversarially measures semantic common sense. Perhaps surprisingly, all of the datasets we used asked questions that fell into just one level of the taxonomy (Tab. 2). These datasets do focus on very specific problems, but the result is still disappointing because it would be more useful to see variations in both task and clarification level. It may be interesting to develop datasets that can better express the range of abilities described by Bloom's Taxonomy.

Language Model
We use distill-GPT2 (Sanh et al., 2019) and the publicly released GPT-Neo2.7B (Black et al., 2021) (based on EleutherAI's replication of the GPT-3 architecture) as the language models throughout our experiments. Our clarification question prefixes and hyperparameter settings for both models are  Shwartz et al., 2020). For each question prefix, we generate 5 clarification questions using nucleus sampling threshold probability p = 0.2 and adding at most 6 words to the clarification question prefix. We then generate 10 answers to each clarification question using p = 0.5 and maximum answer length 10. Some changes were necessary to accurately measure the impact of clarification level. Instead of always including no clarification as a choice we do not allow this option as it defeats our goal of measuring clarification level impact. Furthermore, we do not use the clarification questions which were manually completed without input from the model (as in COPA and Winogrande).
In order to compare performance across different levels of clarifications we only consider examples where the model was able to generate at least one clarification from each level. To increase the number of viable examples we found it necessary to remove some restrictions relative to the implementation of (Shwartz et al., 2020). In particular, we kept all clarifications that had no overlapping words with the context and did not allow the model to chose the "no clarification" option. Even with these constraints it was still often the case that distil-GPT2 could not generate a short clarification sentence that was plausible enough to use whereas GPT-Neo was able to generate clarifications for almost the entire dataset. This indicates larger scale models may be more able to take advantage of clarifying questions. The number of examples with valid clarifications for all levels is indicated for each model in column 2 of Tab. 2. These changes help us more accurately measure the impact of Bloom's Taxonomy, but mean our approach is not directly comparable to Shwartz et al. (2020). Table 2 reports the performance of our Bloom's Taxonomy infused zero-shot question answering method. Each row shows question answering accuracy for a particular dataset and level of clarification. If our hypothesis is correct then the level of available clarifications should matter and clarifications that provide proximal context -one level below the dataset level-should be most helpful.

Results
Clarification Level Makes a Difference. All levels of clarification questions and answers provide some amount of extra information that changes how a language model processes the entire string it is presented with. This is often helpful information, but it may be that all levels of Bloom's Taxonomy provide equally useful information. We find that is not the case. Different levels of clarification help more or less, as evidenced by the large gap between minimum and maximum accuracy for each dataaset. Furthermore, when the model can choose any clarification (rows 0A/B/C/D) it either does a worse job than proximal context or its performance similar to proximal context, so enforcing a particular kind of context should be helpful.
Proximal Context Helps Most. Proximal context, as we've defined it with respect to Bloom's Taxonomy is context from the clarification level directly below the dataset question level. The proximal clarification level for each dataset is marked by a * in Tab. 2. In all cases proximal clarifications are better than using clarifications of a lower level. For the datasets that ask level 3 questions the proximal (level 2) clarifications also outperform level 1 clarifications (2B/C/D greater than 1B/C/D). Proximal clarifications are also about as good as or better than using clarifications of a higher level. You can see this for Winogrande by noting row 1A is greater than 2A and for the other datasets by noting rows 2B/C/D usually have greater performance than 3B/C/D. Overall, proximal context is most consistent in efficacy.

Qualitative Results
In Tab. 1 we show samples of question answer pairs generated for each model and in Tab. 5 of the appendix we show complete examples (with context and choices) for each model and dataset. GPT-Neo is much larger than distil-GPT2 and is expected to generalize to slightly new tasks like the clarification generation task better than the smaller model. This expectation is clearly met by the observed quality of clarifications. Distil-GPT2 clarification questions and answers often do not have meaningful semantics, are not correct, or are not relevant. GPT-Neo is much more likely to generate questions and answers which are meaningful, correct, and relevant. This suggests the greater number of valid clarifications generated by GPT-Neo may be due to an increase in clarification quality. Furthermore, it fails in an intuitive fashion: when it fails to generate meaningful answers it often has also failed to generate a meaningful clarification question in the first place.
Also note that the performance differences observed for distil-GPT2 occur despite its relatively poor interpretability. This indicates that context which is somewhat relevant to the topic even if it does not precisely make sense can still be useful.

Conclusion
Large pre-trained language models sometimes have the right information, but they just do not know how to use it. We used Bloom's taxonomy to pick questions with the right amount of proximal context. This helped the language models use their knowledge to more effectively answer questions. In the future we would like to extend our work on tasks that present a wide range of questions that fall under different levels of the taxonomy. Similarly, we also would like to study and improve upon the current limited set of prefix questions used.

A Prefixes and Examples
In the appendix we provides more details about the question prefixes we used in Tab. 3 and provide more examples of outputs from our models in Tab. 5.  3  3  3  3  3  2  2  2  3  3  3  2  2  2  2  3  3  3  1 Winogrande What is the definition of What is the main purpose of What is the main function of a What are the properties of a What is What does it mean to The definition of is The purpose of is to The main function of a is The properties of a are that is means 1 2 2 1 1 2 Table 4: Example contexts, questions, choices, clarification questions and clarification answers for each dataset. We present results for both Distil-GPT2 and GPT-Neo.