Elaboration-Generating Commonsense Question Answering at Scale

In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge that helps improve performance. Yet the cost of working with such models is very high; in this work, we finetune smaller language models to generate useful intermediate context, referred to here as elaborations. Our framework alternates between updating two language models—an elaboration generator and an answer predictor—allowing each to influence the other. Using less than 0.5% of the parameters of GPT-3, our model outperforms alternatives with similar sizes and closes the gap with GPT-3 on four commonsense question answering benchmarks. Human evaluations show that the quality of the generated elaborations is high.


Introduction
Commonsense question answering (QA; Talmor et al., 2019) provides benchmarks used to evaluate the extent to which NLP models-increasingly based on language models-can "understand" questions and reason about their answers. For example, consider the question in Figure 1: Gases released during the use of fossil fuels cause a what? A reasonably informed human could give the answer global warming, with reasoning such as: Fossil fuel emissions are the main source of greenhouse gases. They cause global warming. It is common to use LMs to predict answers directly for QA tasks (Devlin et al., 2019;Liu et al., 2019;Khashabi et al., 2020).
On challenging datasets whose questions rely on unstated background knowledge (Talmor et al., 2021;, some recent works rely on external knowledge, e.g., Wikipedia or structured knowledge bases (Mihaylov and Frank, 2018;Lin et al., 2019; for additional information that helps to answer the question. Such attempts are limited by the availability and coverage of the knowledge sources. Another line of study (Liu et al., 2022;Paranjape et al., 2021;Shwartz et al., 2020) demonstrates that generating text that expresses additional background knowledge relevant to a question is beneficial for answer prediction. The ability to express such knowledge may promote model explainability by explicitly showing the reasoning process. However, expressing high-quality knowledge relies on massive (and thus, expensive) pretrained LMs, e.g., GPT-3 with 175B parameter size (Brown et al., 2020). In this work, we focus on a more practical setting and investigate the ability of smaller LMs, e.g., BART which is around 400× smaller than GPT-3, for reasoning and inference in an end-toend manner. To this end, we propose a scalable framework, ALternating Elaboration and Answer Producer (ALEAP), consisting of two interactive modules: an elaboration generator and an answer predictor. Instead of generating intermediate contexts (also known as elaborations) independently, we propose a probabilistic framework that treats the elaboration as a latent variable and iteratively optimizes the elaboration generator after receiving feedback from the answer prediction. Specifically, for each question-answer pair (q, a), we decompose the distribution of the answer conditioned on the question P (a | q) into a distribution P (c | q) over a latent elaboration, modeled by the elaboration generator, and a likelihood distribution P (a | c, q) over the answer, modeled by the answer predictor. To optimize P (a | q), we alternately train the elaboration generator and the answer predictor so that each can benefit the other. Earlier work either pre-constructs elaborations c from external knowledge (Mihaylov and Frank, 2018) or learns P (c | q) solely based on annotations (Rajani et al., 2019); we learn the elaboration generator by distilling high-quality knowledge from the massive GPT-3 language model. To do this, we use a procedure inspired by hard Expectation-Maximization (Min et al., 2019). This involves refining and filtering elaborations informed by the answer predictor, as shown in Figure 1. ALEAP is thus capable of propagating information in both directions: from elaboration generator to answer predictor and vice versa.
We conduct experiments on four commonsense QA datasets: CommonsenseQA (Talmor et al., 2019), CommonsenseQA 2.0 (Talmor et al., 2021), scientific commonsense (QASC;  and OpenBookQA (OBQA; . The experimental results reveal that (1) using much smaller LMs (e.g., T5, BART, and GPT-2) finetuned for both elaboration generation and answer prediction narrows the gap between small models and GPT-3; (2) the ability to generate and reason among background elaborations indeed brings larger performance gains than direct inference on more challenging Commonsense QA datasets; (3) the alternating framework helps to filter irrelevant elaborations generated from GPT-3 and the learned elaboration generator can express information that helps to answer the question, as shown through human evaluations.

Modeling Answers and Elaborations
We focus on the task of commonsense question answering in the multiple-choice setting. Given a commonsense question, we seek to identify the correct answer among the candidate choices. Importantly, we are not provided with any additional elaboration that may be needed to identify the answer. We formalize the setting and define the model in this section, and Section 3 details the training procedure.

Elaborations as A Latent Variable
We formalize commonsense QA in a probabilistic framework. Given a question q and its correct an-swer a, we seek to train a model that maximizes the probability of the correct answer, i.e., P (a | q). Directly predicting the answer can be be challenging when complex understanding is involved. Moreover, doing so renders the provenance of the answer unclear. To address both issues, we assume that the answer depends on some latent elaboration c ∈ C with C denoting a set of probable elaborations. With the addition of a latent variable, the training objective becomes Here, the first term in the summation, P (c | q), denotes the probability of an elaboration c conditioned on question q and is captured by the elaboration generator. The second term P (a | c, q) characterizes the distribution of the answer a conditioned on both the elaboration and the question and is captured by the answer predictor.

A Joint Model
The elaboration generator seeks to generate an elaboration sequence c given the question q as a prompt. We denote the conditional probability of an elaboration given a question by F C ; that is, using the notation from Eq. 1, we have P (c | q) = F C (c, q; Φ). We model the elaboration generator using a generative language model that computes the distribution of tokens at each generation step: where c = {c 1 , ..., c m } denotes the generated elaboration sequence. In our experiment, we adopt two generation models-BART (Lewis et al., 2020a) and GPT-2 (Radford et al., 2019) The answer predictor, denoted F A , aims to produce the probability of an answer sequence a given a question q and an elaboration c, i.e., P (a | c, q) = F A (a, c, q; Θ). Any language model could be adopted as the answer predictor. For generality, we select two commonly-used language models from two different paradigms, namely BERT (Devlin et al., 2019) as a masked language model and T5 (Raffel et al., 2020) as a generative language model. For T5, F A (a, c, q; Θ) is computed for an answer sequence a = {a 1 , ..., a n } using F A (a, c, q; Θ) = n t=1 p T5 (a t | c, q, a 1 , ..., a t−1 ), with p T5 denoting the generation probability of token a t using T5 model. For BERT, F A (a, c, q; Θ) is computed by applying a softmax layer and a linear layer on top of the hidden representation of the [CLS] token where h [CLS] is obtained by feeding "[CLS] elaboration [SEP] question [SEP] answer [SEP]" into the BERT model.

Inference
In the testing phase, for each question, we first use the trained elaboration generator F C to sample a set of elaborationsC. For eachc ∈C, we use the answer predictor F A to produce the distribution over each candidate answer a i as P (a i | c, q) = F A (a i ,c, q; Θ). By running the answer predictor for each sampled elaboration, we take the maximum probability as the score for candidate a i which is then used to produce the final prediction: with A denoting the set of candidate answers.

Alternating Elaboration and Answer Producer (ALEAP)
Existing retrieval or knowledge-based QA methods only optimize P (a | c, q), assuming c is given and fixed. Explanation-based methods, on the other hand, train P (c | q) separately using human-annotated explanations. The separate training paradigm poses two problems: (1) we need an annotated explanation corpus, and (2) the elaboration generator cannot be calibrated towards the answer. In this work, we propose an approach that tackles both problems by jointly training the elaboration generator and the answer predictor in an alternating framework. The overall architecture for training is illustrated in Figure 2. At each iteration, the elaboration generator F C learns to produce high-quality elaborations by receiving feedback from the answer predictor (Section 3.1). The answer predictor F A then takes the generated elaborations as input to produce more reliable answers (Section 3.2). This strategy allows mutual interaction between the two components, propagating information in both directions. To reduce the search space of possible elaborations, we propose to distill knowledge from the pretrained GPT-3 model "Gases released during the use of fossil fuels cause a what? " q GPT-3 "Carbon dioxide is released during the use of fossil fuels." "Fossil fuels are used to power cars and airplanes." c1 c2 Elaboration generator FA(a,c, q; Θ) Answer predictor ā C "Carbon dioxide is released during the use of fossil fuels." c1 C sample filterC Figure 2: The training framework, which alternates between learning the elaboration generator (dotted arrows) and learning the answer predictor (solid arrows). The elaboration generator is optimized via an EM-like algorithm with the E-step (red arrow) sampling and filtering high-quality elaborations and the M-step (blue arrow) maximizing the probability of C.
in a selective way to learn a (more lightweight) elaboration generator (Section 3.3).

An EM-Inspired Learner
Our goal is to optimize Eq. 1, rewritten below: Directly optimizing the elaboration generator in the expectation term here is difficult. 1 Inspired by Qu et al. (2021), we adopt a hard EM framework to update the elaboration generator. Here the Estep first generates a set of elaborations related to the question and then selects "good" elaborations that help to predict the correct answer. The Mstep maximizes the probability of generating these "good" elaborations.

E-
Step. The E-step aims to identify a set of "good" elaborations from the posterior probability of an elaboration c after observing the correct answer a: The posterior approximation on the RHS of Eq. 7 aligns with the intuition that the elaboration could have higher probability if it is both relevant to the question (i.e., P (c|q)) and, when combined with the question, provides higher chance of predicting the correct answer (i.e., P (a|c, q)). However, it is non-trivial to sample from P (c | q)P (a | c, q) considering the space of possible elaborations is intractable. To alleviate this issue, we propose two approximation strategies. First, we use GPT-3 to produce more reliable distribution P (c | q), and thus rewriting Eq. 7 as P (c | q, a) ∝ P GPT-3 (c | q)P (a | c, q). Second, we approximate the sampling process via a two-step sample-andfilter procedure. Specifically, we first sample a set of elaborationsC from P GPT-3 (c | q) which will be discussed in Section 3.3. In the second step, we filterC by computing a simpler version of P (a |c, q) for each candidate elaborationc ∈C by only normalizing over the setC. We denote the simplified answer probability 2 as P 0 (a |c, q), which is computed using the answer predictor F A : . (8) Using Equation 8, we select top-K elaborations fromC to form C as the set of "good" elaborations. The benefit of this operation is to introduce the answer predictor to assist in learning how to select elaborations, which to our knowledge has not been explored in past work.

M-
Step. With the selected context set C produced in the E-step, the M-step aims to maximize the probability of each elaboration c ∈ C to update the elaboration generator F C while keeping the answer predictor fixed: In this way, the elaboration generator learns to produce elaborations that are both relevant to the question and with a higher probability of predicting the correct answer. Eq. 9 could also be viewed as a kind of selective distillation, which instead of distilling all the sampled elaborationsC from GPT-3, learns to filter out noisy elaborations before transferring knowledge to the elaboration generator.

Optimizing Answer Predictor
After updating the elaboration generator, the next step of the alternative training aims to update the answer predictor F A (a, c, q; Θ) while keeping the elaboration generator fixed. To achieve that, we 2 We also implement other filtering options based on the answer predictor, e.g., only selecting a subset C ⊆ C where FA(a,c, q; Θ) > FA(a − ,c, q; Θ) with a − denoting the incorrect answers andc ∈C. Another option is to replace FA(a,c, q) in Eq. 8 with FA(a,c, q) − 1 |A − | a − ∈A − FA(a − ,c, q) with A − denoting the set of incorrect answer choices. As will be shown in Section 4.4, Eq. 8 achieves the best performance. approximate the objective of Eq. 6 to log P (a | c, q) by sampling a set of elaborationsc ∈C from the elaboration generator P (c | q) = F C (c, q; Φ). Then the objective becomes to maximize log P (a |c, q) = log F A (a,c, q; Θ) for the correct answer a. The sampled elaborationc from the elaboration generator acts as additional background and explanation for the question, which helps to learn a more reliable prediction model to answer the question. The alternation between updating the answer predictor and the elaboration generator promotes mutual enhancement of each component. The entire training procedure of ALEAP can be found in Algorithm 1.
Algorithm 1 Training procedure of ALEAP.

8: M-
Step: Update the elaboration generator FC using Eq. 9 with C and q. 9: end for 10: B. Optimize answer predictor FA to produce P (a | c, q) (Section 3.2) 11: for a question-answer pair (q, a) in batch do 12: Sample a set of candidate elaborationsC using FC trained in the previous step. 13: For eachc ∈C, update the answer predictor FA by maximizing Eq. 10 given a andc. 14: end for 15: end for 16: end for 3.3 Distilling GPT-3 As discussed in the E-step, we use GPT-3 to sample possible elaborations in order to train our elaboration generator. Liu et al. (2022) showed that, using a small number of prompts and a question, GPT-3 can generate useful knowledge to enhance answer prediction. Inspired by Hinton et al. (2015) and West et al. (2021), we adopt the idea of knowledge distillation to transfer knowledge from GPT-3 (which would be expensive to deploy at inference time) to our (cheaper) elaboration generator. Specifically, we first use GPT-3 to generate a set of elaborations given some predefined prompts. Following Liu et al. (2022), for each task, we design the prompt as a short instruction followed by five demonstrative examples and a new-question placeholder. By plugging each question into the placeholder, we can repeatedly sample an elaborationc as the continuation of the prompt. This yields a set of candidate elaborations,C.
Here we use nucleus sampling (Holtzman et al., 2020) to sample each elaborationc. For knowledge distillation, a naive strategy could be optimizing the elaboration generator by minimizing with P s denoting the student network, i.e., our elaboration generator. However, as shown in the experiments, GPT-3 is prone to generating noisy text sequences that may not be relevant to answer the question. This would lead to negative transfer. Our proposal in the E-step, on the other hand, could be viewed as a form of selective knowledge distillation which filters elaborations generated from GPT-3 according to the answer score before optimizing our student model.

Experiments
In this section, we examine the question: Does jointly optimizing the elaboration generator with the answer predictor outperform approaches that merely retrieve knowledge from trained models, if at all? As a secondary objective, we also investigate the impact of the design choices in our approach, including the choice of the language model, the need for distillation, the choice of elaboration filtering and the decoding strategy.

Data and Setup
To evaluate our proposed model, we select four multiple-choice commonsense QA datasets as follows: (1) CommonsenseQA (CSQA; Talmor et al., 2019) is created based on commonsense knowledge from various concepts in ConceptNet. Most of the questions require implicit background knowledge that is trivial to humans. The dataset consists of 12,247 examples, each of which is a 5way multiple-choice selection problem. (2) Com-monsenseQA 2.0 (CSQA2; Talmor et al., 2021) is a more challenging dataset collected in an adversarial manner where a user is encouraged to create questions for which a well-trained ROBERTA model (Liu et al., 2019) fails to provide the correct answer. The dataset contains a total of 14,343 questions with binary answer choices (yes/no). (3) QASC ) is a question answering dataset requiring compositions of multiple pieces of texts. It is collected from elementary and middleschool science questions. The dataset contains 9,980 questions, each of which is followed by 8 different choices. Note that we do not use the goldannotated background facts accompanied with the original data, in order to test the model's ability to automatically elicit knowledge and reason. (4) OpenBookQA (OBQA; ) is a collection of open book exams on elementary-level science facts. It contains a total of 5,957 questions with four candidate choices for each question. Similar to QASC, we also remove the gold-annotated science facts in the original release.
We use GPT-3 (Brown et al., 2020) with nucleus sampling p = 0.5 (Holtzman et al., 2020) to sample 20 elaborations for each question. During alternative training, for each iteration, we use 100 instances to update the elaboration generator followed by the answer predictor. The elaboration generator is implemented using GPT2-large (Radford et al., 2019) and BART-large (Lewis et al., 2020a). The answer predictor is implemented using T5-large (Raffel et al., 2020) and bert-baseuncased (Devlin et al., 2019). We adopt Adam optimizer with learning rate initialized at 10 −5 for both components. The elaboration generator generates |C| = 10 elaborations during both training and testing phases via nucleus sampling p = 0.95 and with temperature set as 0.7. We set K = 3 when forming the top-K elaboration setC during the E-step.

Baselines
We organize the baselines into four groups: (1) Direct answer prediction without additional knowledge (direct). (2) Answer prediction with retrieved knowledge: COMET uses COMET (Bosselut et al., 2019) trained on the ATOMIC corpus  to automatically generate causes and effects of a question as a form of commonsense knowledge base completion. Wikipedia follows Chen et al. (2017) that retrieves and ranks text spans in wikipedia articles based on bigram hashing and TF-IDF matching. (3) Answer prediction with elaborations generated using fixed language models: selftalk generates extra background knowledge based on some clarification questions asked towards the commonsense question using zero-shot language models (Shwartz et al., 2020).  Table 1: Accuracy results of models with or without external knowledge sources and elaboration generators.
Here we use GPT2-large as the elaboration generator.
GPT-3 uses GPT-3 (Brown et al., 2020) to sample 10 knowledge spans as continuations of the question using some demonstrative prompts. (4) Trained elaboration generator: scratch implements alternative training without distilling knowledge from GPT-3. pretrain first pretrains the elaboration generator by treating all the sequences generated from GPT-3 as ground truth, then finetuning the answer predictor taking elaborations generated from the pretrained elaboration generator (i.e., no influence from the answer predictor). For fair comparisons, all the four groups require training the answer predictor F A . The second and third groups additionally involve intermediate contexts which are kept fixed. The last group learns both an elaboration generator and an answer predictor. During inference, we pick the choice with maximum score across all the knowledge sequences or generations following Eq. 5.

Results
The experimental results are shown in Table 1.
Here we use T5-large as the answer predictor for CSQA, CSQA2, QASC, and BERT for OBQA. These are chosen according to the best performances given. As revealed in Table 1, the advantage of additional knowledge or elaborations is more evident for CSQA2, QASC and OBQA, compared with CSQA (which contains relatively simpler questions). This verifies the importance of reasoning for complex QA problems. On the other hand, GPT-3 demonstrates significant performance gains over other knowledge sources. Using less than 5% of the parameters of GPT-3, ALEAP outperforms GPT-3 on two datasets. Without access to an external knowledge resource at inference time, ALEAP achieves the best performance across all the datasets. It also clearly outperforms those mod-els having similar computational cost (e.g., scratch, pretrain). The performance gain of ALEAP over pretrain demonstrates the advantage of our alternating framework. When training from scratch, the elaboration generator is prone to learning meaningless shortcuts, e.g., "The correct answer: I know I'm not sure but whatever", as observed from the actual generations.

Analysis
In subsequent experiments, we use the development set of each corpus to make evaluations. Elaboration Generator. The elaboration generator is learned by finetuning a generative language model. Here we investigate the effects of different LMs, specifically BART-large and GPT2-large, as shown in Table 2. Both generators demonstrate consistent results comparing different training strategies (scratch, pretrain, ALEAP). In addition, it can be observed that GPT2-large slightly outperforms BART-large across all the experiments. The higher performance of GPT2-large could be credited to a larger parameter size (774M) compared to BARTlarge (406M). Another observation is that GPT2large has more generation flexibility. The generated sequences using nucleus sampling are less repetitive and thus cover more aspects relevant to the question, compared to BART-large which are restricted to repetitive patterns. For illustration, we pick a question from CSQA2: Cotton candy is sometimes made out of cotton? BART-large only generates this single distinct elaboration: "Cotton candy is a candy made from cotton." Whereas GPT2-large samples a variety of continuations, e.g., "The name 'cotton candy' refers to a product made from the soft, viscous material produced by certain plants.", "The most common type of cotton candy is known as cotton candy fluff, which is a mixture of corn syrup and cotton.", "Cane sugar is a form of artificial sweetener, usually produced by grinding raw cane sugar to a pulp." Elaboration Filtering. As discussed in Section 3.1, the filtering process using Eq. 8 in the E-step could also be replaced with other options. We implement three criteria to investigate the effect of different filtering options. As shown in the first block (Elaboration filtering) of Table 3, the random option filters GPT3-generated elaborations by randomly sampling 3 out of 20. The correct option selects all the elaborations that produce the correct answer when fed into the answer predictor. The   pos-neg option computes the score difference between the correct answer and the average of incorrect answers, based on which 3 elaborations with highest scores are being selected. The pos option corresponds to Eq. 8 adopted by ALEAP. Clearly, random selection produces inferior results among all the options, verifying the benefit of filtering high-quality elaborations for training the elaboration generator.
Elaboration Integration. The second block (Elaboration integration) of Table 3 investigates the effect of different elaboration integration methods during inference. Recall from Eq. 5 that ALEAP uses maximum pooling among all the generated elaborationsC for final predictions. We are interested in how different inference strategies may affect the final performance. Specifically, instead of maximum pooling, we concatenate all the elaborations inC in a single sequence and feed it into the answer predictor (concatenate). This brings significant performance drop on CSQA and QASC, probably due to the unexpected noise and the forgetting issue for long sequences. Another strategy is to formalize inference with a probabilistic view where each generated elaboration has a probability contributing to the final prediction via weighted aggregation (probability). To produce the probability, we apply a softmax layer on top of the output logit of each generated elaborationc ∈C from the generation model. The last option is to compute the similarity between each elaboration and the question and use the most similar elaboration for final inference (similarity). Here we use sentence embeddings generated using sentence transformers (Reimers and Gurevych, 2019) with cosine similarity to select the optimal elaboration. As a result, maximum pooling appears to suggest at most of the times that we have a deductive proof.
Decoding Strategy. We are interested in how different decoding strategies inherent in the LMs may affect the final performance. In the last block (Elaboration generation) of Table 3, we compare the results of greedy decoding (greedy) where each decoding step only selects the token with highest probability, beam search (beam) with size 10 at each decoding step and selecting top 10 sequences, decoding with nucleus sampling (sample) which is implemented in the proposed model ALEAP. As shown in Table 3, decoding via sampling produces the best results or comes very close.
Sensitivity Test. We further investigate the effects of changing (1) the number of filtered high-quality elaborations (K) from GPT-3 and (2) the size of setC corresponding to the total number of elaborations generated from the elaboration generator. The results are shown in Figure 3. The top plot demonstrates the performance increases when increasing K from 1 to 3, but decreases for K > 3. This pattern verifies that GPT-3 may generate elaborations that negatively affect the final performance. On the other hand, increasing the number of sampled elaborations from the elaboration generator (from 2 to 20) during both training and testing phases brings gradual improvements in terms of accuracy scores. This is as expected, given that sampling a diverse set of elaborations should add up to a wide coverage of relevant knowledge for the question.

Human Evaluation
To evaluate the quality of elaborations for question answering, we conduct two sets of human evaluations on QASC and CSQA2. For the first experiment, we investigate whether the filtered elaborations from GPT-3 are considered more helpful to answer the question compared to those that are not selected by humans. For the second experiment, we evaluate the quality of the elaborations generated from the elaboration generator. The annotation task was carried out in Amazon Mechanical Turk. We restrict annotators to those located in English-speaking countries and who have at least 99% approval rate over more than 1000 tasks. The results are aggregated using majority vote among annotations from 3 workers. Our institution's IRB approved the study. We paid workers an estimated US$15 per hour.
Effect of Filtering. Recall that we use the answer predictor to filter elaborations generated from GPT-3 via Eq. 8 in the E-step. To demonstrate whether the filtering process is capable of removing noisy elaborations, we randomly sample 100 questions from the training corpus of each of two datasets (QASC, CSQA2). For each instance, we present the crowd workers with a question, the correct answer, the elaboration (denoted SELECT) that has the highest score according to Eq. 8, and an elaboration (denoted DISCARD) randomly sampled from the remaining ones that are discarded by the answer predictor. Here the elaborations refer to those pregenerated from GPT-3. The workers are then asked to evaluate the SELECT and DISCARD elaborations by choosing 1-out-of-3 choices: helpful (the elaboration adds useful information to answer the question), neutral (the elaboration has no influence on the problem) and harmful (the elaboration is misleading). To avoid annotation bias, we randomize the order of SELECT and DISCARD elaborations for each example. The results are shown in Fig number of helpful elaborations annotated by the workers is considerably higher for the selected category than that of the discarded category. In contrast, the workers agree that the selected elaborations are less likely to be neutral or harmful compared to those that are discarded. The difference is even more evident on CSQA2. This verifies the necessity of using the answer predictor to filter noisy elaborations generated by GPT-3 before distilling the knowledge.
Elaboration Quality. In another experiment, we compare the quality of the elaboration generators from the pretraining setup, GPT-3 and our proposed model ALEAP. We select only one elaboration generated from each model that gives the highest score of the predicted answer during inference, which is actually adopted to produce the final prediction.
Adapting from the metrics provided by Shwartz et al. (2020) and Liu et al. (2022), given a piece of automatically-generated text, we pick three aspects: (1) Factuality evaluates whether the text is entirely correct (factual), partially correct (partial) or entirely incorrect (incorrect); (2) Relevance evaluates whether the text is relevant or irrelevant to the topics discussed in the question; (3) Helpfulness evaluates whether the text provides useful information that helps answer the question (helpful), has no effect (neutral) or is misleading (harmful). For the first two aspects, we only present the workers with a question and the selected elaboration from one of the models. For the third aspect, we additionally provide the correct answer to the workers. The human evaluation results on 100 randomly sampled test examples from CSQA2 are shown in Figure 5. Clearly, ALEAP achieves better scores across all the three aspects, with the most evident improvement in terms of helpfulness. This result is more encouraging considering that helpfulness is the most effective aspect for commonsense QA   which directly affects the answer to a question. We additionally conduct another experiment to evaluate the effect of our proposed elaboration generator for human predictions. Specifically, we randomly sample 100 test examples from QASC. For each example, we present the workers with the question and ask them to choose only one answer from multiple choices. In another round, we provide both the question and the generated elaboration to the workers and collect their answers. The two rounds of experiments recruit non-overlapping annotators to ensure validity. As a result, 78 questions are correctly answered by workers without seeing extra elaborations. On the other hand, 81 questions are correctly answered when elaborations are provided. This reflects that our elaboration generator is still beneficial to humans even though commonsense QA appears to be much easier for humans than machines.
Based on the annotations given by crowdsourced workers, we collect only those instances which contain an elaboration generated by our model that is labeled as helpful by the workers. This results in 70 and 76 instances from the development set of QASC and CSQA2, respectively. We then compare the performance of ALEAP under three different settings: (1) No Elaboration only presents the question to the model during inference; (2) Random Elaboration additionally provides a generated elaboration which is randomly selected after removing the one labeled as helpful; (3) Helpful Elaboration contains the single elaboration that is selected and labeled as helpful by workers. The results are shown in Table 4. As expected, our model with helpful elaborations outperforms the other two settings by a large margin, aligning with our intuition that the addition of meaningful elaborations are beneficial to the task.

Related Work
Direct Inference. Given only natural-language commonsense questions, a straightforward solution is to directly use language models, either finetuned from the gold-annotated answers (Sakaguchi et al., 2021;Talmor et al., 2019;Khashabi et al., 2020;Talmor et al., 2021) or in an unsupervised setting (Trinh and Le, 2018;Petroni et al., 2019;Puri and Catanzaro, 2019; that exploit knowledge already encoded in the pretrained parameters to perform inference. However, beyond the performance score, it is unclear how these models reach the final answer and whether they perform correct reasoning. It is also challenging to conduct direct inference without additional knowledge for complex queries. Inference with External Knowledge. It has been shown that external knowledge such as knowledge bases or Wikipedia contains rich information that could assist inference. Knowledge bases, e.g., Con-ceptNet (Speer et al., 2017), ATOMIC , contain relational knowledge that could be incorporated as additional inputs for commonsense QA Chang et al., 2020;Bian et al., 2021;Ma et al., 2021;Lv et al., 2020;Yasunaga et al., 2021). Large corpora are another knowledge source to retrieve question-related facts (Lin et al., 2017;Tandon et al., 2018;Joshi et al., 2020;Lewis et al., 2020b). These knowledge-based approaches depend on the availability and coverage of the knowledge source, which usually depends on the problem domain. Inference with Generation. To alleviate the dependence on external knowledge, recent trends advocate for automatic generation of additional knowledge related to the question via language models. One direction is to learn a generator to generate meaningful justifications for question answering via human-authored explanations (Camburu et al., 2018;Rajani et al., 2019;Latcinnik and Berant, 2020).  adopted a pre-trained commonsense generation model (Bosselut et al., 2019) to generate implications of the questions. These approaches, however, require goldannotated commonsense facts to train a good generator. Another direction explores zero-shot generations using pretrained language models. Shwartz et al. (2020) introduced Selftalk, which elicits question clarifications using a few pre-defined templates. Paranjape et al. (2021) further proposed contrastive prompts that compare candidate options for choosing the correct answer. Liu et al. (2022) generated additional texts as continuations of each question by feeding demonstrative prompts to GPT-3. Different from existing approaches, we seek to learn an effective generation model jointly with the answer prediction to allow for mutual enhancement.

Conclusion
We propose an alternating framework for commonsense QA problems that alternates between learning a meaningful, relatively lightweight elaboration generator and producing an answer from the question and automatically generated elaboration. These two steps are trained interactively, propagating signals to each other. We narrow the performance gap between small LMs and GPT-3, with the elaboration generator producing elaborations judged useful by humans, and matching the performance of the much more expensive GPT-3 model as a elaboration generator.