Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models

Commonsense reasoning benchmarks have been largely solved by fine-tuning language models. The downside is that fine-tuning may cause models to overfit to task-specific data and thereby forget their knowledge gained during pre-training. Recent works only propose lightweight model updates as models may already possess useful knowledge from past experience, but a challenge remains in understanding what parts and to what extent models should be refined for a given task. In this paper, we investigate what models learn from commonsense reasoning datasets. We measure the impact of three different adaptation methods on the generalization and accuracy of models. Our experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.


Introduction
Machine commonsense reasoning has recently gained new traction, largely due to a collection of diverse benchmarks (Talmor et al., 2019;Bhagavatula et al., 2019;Sap et al., 2019) and the successful application of language modeling methods on these benchmarks Shwartz et al., 2020;Bauer and Bansal, 2021). The most widely adopted approach to solve these commonsense reasoning tasks is by fine-tuning large pre-trained language models (LMs) (Devlin et al., 2019;Liu et al., 2019) on the task-specific training data. Meanwhile, it has been shown that language models are able to acquire certain commonsense background knowledge, during their pre-training on large textual data (Petroni et al., 2019;Davison et al., 2019;Ma et al., 2021). In light of these findings and the large capacity of these language models, recent work has proposed lightweight alternatives to finetuning LMs, e.g., by only updating a small amount of additional parameters (Lin et al., 2020b;Li and Liang, 2021), or by updating the inputs while keeping the model weights intact (Jiang et al., 2020;Shin et al., 2020). Intuitively, these lightweight methods may retain the model's pre-trained knowledge to a large extent, and elicit the suitable knowledge for the target task, provided that much of this knowledge has already been encoded in the model parameters. However, to our knowledge, no comprehensive comparison exists between these model updating strategies.
In this paper, we pose the question: What do models learn from commonsense reasoning datasets? We consider three representative learning methods: regular fine-tuning, model extension with prefixtuning (Li and Liang, 2021), and model prompting with Autoprompt (Shin et al., 2020). We apply them to two representative model classes: the autoregressive language model GPT-2 (Radford et al., 2019) and sequence-to-sequence language model BART (Lewis et al., 2020). We conduct thorough evaluation on the generative evaluation benchmarks ProtoQA (Boratko et al., 2020) and CommonGen (Lin et al., 2020a), by training on different partitions of the training data. Our experiments show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. Prompting methods have lower accuracy, but tend to show higher robustness to "adversarial" splits. Extending the models by prefix-tuning represents a "sweet spot" between task accuracy, generalization, and robustness.

Related Work
Prior works probe the commonsense knowledge learned by the LMs. Davison et al. (2019) mined commonsense knowledge from LMs, using templates with masked tokens; Richardson and Sabharwal (2020) designed diagnostic tasks to probe LMs' knowledge of definitions and taxonomic reasoning. The LAMA probes (Petroni et al., 2019) demonstrate that LMs can largely recover knowledge in existing (commonsense) knowledge graphs: they could thus be queried/prompted directly as knowledge bases (Shwartz et al., 2020;Shin et al., 2020). Ettinger (2020) diagnoses the BERT model, finding that it struggles with complex inference, role-based event prediction, and grasping the contextual impacts of negation. The logical commonsense probes in RICA (Zhou et al., 2020) show that LMs perform similar to random guessing in the zero-shot setting, they are heavily impacted by statistical biases, and are not robust to linguistic perturbations. Elazar et al. (2021) posit that while LMs can learn to perform well on commonsense tasks, their commonsense reasoning ability mostly comes from fine-tuning on the task data. Some works have sought to uncover what models learn through training on question answering datasets, exposing various dataset artifacts in the process (Jia and Liang, 2017;Kaushik and Lipton, 2018;Pugaliya et al., 2019). Welbl et al. (2020) found that models trained on the SQuAD2.0 dataset (Rajpurkar et al., 2018) are insensitive to the meaningful changes in the question and predict the same answer. Ko et al. (2020) found that BERT easily picks up the position bias in the SQuAD dataset (Rajpurkar et al., 2016) and models' performance can drop by more than 50 points on f1-score when training on a biased subset. Sen and Saffari (2020) analyzed the model's ability to generalize, by training on 5 different QA datasets, and found that no single dataset is robust to perturbations in the questions. Shah et al. (2020) tested models, trained on several multiple-choice QA datasets, and showed that they are largely relying on dataset biases. Previous work mostly studies the language models, as-is, or evaluated models fine-tuned on the QA datasets. In this paper, we go a step further and investigate the models adapted to a target task, using three different methods, and we study their effect on the model's learning process.

Task and datasets
We experiment with generative commonsense tasks, assuming that they are more realistic to real-world deployment of LMs and that they provide more insight about models' reasoning abilities. Specifically, we evaluate our models on the recentlyintroduced ProtoQA (Boratko et al., 2020) and CommonGen (Lin et al., 2020a) datasets. For Pro-toQA, given a question about a prototypical situation, the model is expected to produce a ranked list of answers. Each question in the dev and test sets is annotated with 100 answers, which are further manually grouped into clusters: the model's outputs are compared with the answer clusters, and the scores reflect the sizes of the matched clusters. We adopt ProtoQA's official evaluation metrics: Max answer@k (percentage of correct answers with top-k predictions) and Max Incorrect@k (percentage of correct answers after making k mistakes). We compute the answer matches based on WordNet similarity, as recommended by Boratko et al. (2020). For CommonGen, given a set of 3-5 input concepts, the task is to generate a scene description, utilizing all input concepts. Following prior work, we adopt BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016), as evaluation metrics.

Strategies
We describe how we adapt pre-trained GPT-2 and BART models to a target task with three methods 1 : (S1) Fine-tuning is the classic model adaptation approach, where all its parameters are updated using the training signal from the ground truth. (S2) Prefix-tuning (Li and Liang, 2021) is a method which fixes the pre-trained model's parameters during adaptation. This method adds trainable parameters, called prefix states, to the self-attention component (Vaswani et al., 2017) of every transformer layer in the model; only these prefix states are updated during training. Essentially, the prefix states act as conditioning variables that contextualize the representation of the inputs, such that the model can generate the desired outputs. (S3) Instead of updating model parameters, Autoprompt (Shin et al., 2020) appends a few triggertokens to the input and updates these trigger-tokens during training. Specifically, the gradient with respect to the trigger-tokens is computed using the ground-truth data. During training, new triggertokens are discovered, along the direction of the gradient, to replace the existing ones and to minimize the loss. Essentially, this method automatically learns to paraphrase the input question so that the model can generate the desired outputs. We select fine-tuning, prefix-tuning, and Autoprompt as they are representative methods for adapting a pre-trained model to a target task, namely: 1. model adaptation (fine-tuning); 2. model extension (prefix-tuning); and 3. input adaptation (Autoprompt). We illustrate whether the model has learned different behaviors from methods with different degrees of adaptation. Training details can be found in the appendix A.1.

Research Questions
We address three questions in this paper, namely: (RQ1: Adaptation level) How do different levels of adaptation affect the model's task-specific performance? We expect that methods that adapt a larger number of parameters to the training task (fine-tuning) would perform better on the task itself, as the larger search space makes it more likely to find a task optimum. We investigate this by comparing S1-S3 on the two benchmarks.
(RQ2: Task structure) Do models only learn the task structure during training? As we are working with relatively small benchmarks, we hypothesize that LMs acquire most of the necessary commonsense knowledge during pre-training instead of at adaptation time, during which they instead learn to elicit this knowledge. In this case, such adaptation to task structure could be done on just a subset of the training data without a large drop in performance, and the model need not depend on any lexical similarities between the training set and the dev set. To this end, we train our models with each adaptation method on: 1) a non-overlap subset of ProtoQA, consisting of train-set QA pairs whose answers do not have any vocabulary overlap with the dev set answers; and 2) a min-overlap split for CommonGen, selecting training instances whose input concepts appear at most once in the dev set.
(RQ3: Novelty) Do models simply memorize the training data, or do they learn to reason on novel questions and answers as well? To test whether models merely retrieve lexically similar examples, we formulate a similarity subset on ProtoQA, comprised of the 100 questions in the training set with the highest cosine similarity for every dev set question. To test whether models achieve better performance due to improved reasoning ability, we selected 30 questions from the ProtoQA dev set, where the model answers are at least partially correct, and we minimally changed the question through manual annotation-so that the required reasoning process is the same, but the answer set is different (example question pairs are given in the appendix A.2). Then, we use the BART model trained with each adaptation method to generate answers for the 30 new questions. We manually validate the 30 new questions and the 30 original questions, in order to compute the accuracy of the models as well as the percentage of overlapping answers between the original questions and new questions. More details are provided in section 4.1.1, and a summary of all the dataset splits used in our experiments is shown in Table 2.

ProtoQA
In response to RQ1 (adaptation level), Table 1 shows that prefix-tuning yields similar or slightly worse results compared to fine-tuning, for both LM classes, indicating that prefix-tuning is a promising lightweight alternative to fine-tuning. Autoprompt lags behind the tuning methods, while outperforming the zero-shot baseline. This is not surprising, as Autoprompt performs a fairly limited adaptation by only updating trigger tokens in a discrete space.
The results for RQ2 (task structure) are shown in Figure 1. Fine-tuning a model on the non-overlap data leads to a drastic drop in performance, compared to using the full training data. Prefix-tuning's drop in performance is smaller than that of finetuning, while Autoprompt achieves the best performance when training on this subset. The result of Autoprompt is similar to training on the full data, showing that Autoprompt is much more robust towards an adversarial training split and is mainly learning how to elicit the model's pretrained knowledge to answer the questions. Finetuning is learning knowledge together with the task structure, while prefix-tuning stands between finetuning and Autoprompt. Since prefix-tuning does not change the pre-trained model's parameters, but rather adds new parameters, it learns to mix the knowledge gained from pre-training with the signal from training instances to answer new questions.
For RQ3 (novelty), the results of training on the similarity subset are shown in Figure 1. Although the number of QA pairs is much lower, fine-tuning achieves the same results as in the full-data setting. This shows that fine-tuning benefits more from the content of the training data than the task format, further informing our findings for RQ2. Prefix-tuning performs slightly worse than the full data setting, indicating that here it is largely learning the training content. Autoprompt achieves similar results as in the full-data and non-overlap settings, confirming our RQ2 observations that models are only learning the task format. We note that, while retrieving lexically similar questions might yield partial results, this form of pattern-matching is insufficient for commonsense reasoning. For example, for a training question Name a vegetable that people like to steam, the model learned the answer cauliflower, which is coincidentally also a correct answer to the dev question Name a vegetable that is as large as your head. In other words, the model answers correctly for the wrong reasons.

Manual Annotation
We conduct manual annotation, to further verify our observations for RQ3. For the BART model, trained with each adaptation method, we generate top10 answers for every question and then annotate each answer, independently. We annotate each answer on a 5-point Likert scale, where 1 means strongly disagree, 2 means mostly disagree, 3 means not sure/it depends, 4 means mostly agree, and 5 means strongly agree. In total, 4 researchers annotated 1165 QA pairs where each QA pair received 3 ratings. The overall Kripendorf alpha (Krippendorff, 2004) score is 0.52, which is moderate agreement. If we merge answers choices 1 and 2 to be 'incorrect' and merge 4 and 5 to be 'correct', and then compute the 3-class categorical agreement score using Fleiss kappa, the agreement score is 0.36, which is fair agreement. Then we consolidate the 3 ratings by taking the average of the 3 annotations and consider an answer to be correct if the average score is greater than 3.5. The results from manual assessment of the models' reasoning capabilities are shown in Table 3. We observe that our LMs are not able to capture subtle changes in the question that lead to a different answer set; models are getting worse performance on the new questions, overall. We believe this is because the newly-generated questions are more difficult to answer, as they seldom appear in any text corpus in general. We also see a high overlap between the generated answers to the original and the newly-created questions, especially for fine-tuning and prefix-tuning, where nearly half (44.7%) of the answers are repeated. This confirms our observation that models memorize/retrieve training-set answers without actually engaging in reasoning.

CommonGen
The full results on the CommonGen dataset are shown in Table 4. Overall, we can see that the results follow a similar trend to those of ProtoQA, as prefix-tuning is able to perform significantly better than fine-tuning when trained on an 'adversarial' split. We notice that the relative drop of performance for both methods on the Min-overlap subset is less drastic than that of ProtoQA. We think this is mainly due to the task format. For ProtoQA, models need to perform one or a few hops of reasoning to answer the questions and there is no direct evidence from the question itself, i.e., the model cannot directly copy answers from the questions. However, the model is directly given the target concepts as inputs for CommonGen, which the model can directly use as its outputs. Thus, we argue that the amount of reasoning required in CommonGen is more restricted than in ProtoQA, and models are less likely to leverage the clues to solve the task. Also, it is worth noting that the accuracy of Autoprompt is extremely low on all 3 splits. In fact, Autopropmt fails to generate any meaningful sentences after training, and the SPICE metric could not be computed. We, again, attribute this to the task format. Autoprompt would eventually discover tokens that are meaningless to humans, and we can think of them as injecting task-specific noise to the pre-trained models. For ProtoQA, the model is expected to generate single words or short-phrase answers to complete the sentence, i.e., converted question, thus it is reasonable for the model to do it even with the injected noise. However, for Com-monGen, the model is expected to generate a full sentence as output; with Autoprompt, the task basically translates to generating a sentence given input concepts and a bunch of random tokens, which is very different from BART's pre-training context.

Conclusions
Experiments with two language model classes, on two generative commonsense benchmarks, under three adaptation methods, revealed that the learning efficiency of LMs relies heavily on the adaptation method. Fine-tuning teaches the model both the structure of the task and the content, prompting approaches focus on learning the task structure only, while model extension by prefix-tuning falls between these two extremes. Consequently, prompting is the least sensitive of the three methods to the training data size and quality, and prefix-tuning can generalize better to novel concepts regardless of the task format. Future work on generalizable common sense leverage these findings, and: 1) avoid finetuning, as we may never be able to create datasets without any unintended biases (Linzen, 2020); and 2) evaluate on multiple independent test-sets to better replicate real-world settings, as training on any split of data can lead to an overestimation of performance (Søgaard et al., 2021). For all of the experiments on CommonGen, we used learning rate of 1e-5, batch size 16, warm-up steps 500 and Adam epsilon 1e-6. We trained the model for 2 epochs and similarly we train models for longer epochs on min-overlap and random subsets. During inference, we do beam search with beam size 5, length penalty 0.6, and repetition penalty 2.0. Note that we disabled positional embeddings in the BART encoder, for all CommonGen experiments, as we found them detrimental to model performance.

A.1.2 Model Implementation
We used the BART-large and GPT2-large model provided by the transformers library (Wolf et al., 2019). For prefix-tuning, we used prefix with length 10 and a 1 layer of prefix MLP with hidden size 512 (we tried {512, 800} and found them to have very close results). The learning rate is 5e-5, while other hyperparameters are the same as in fine-tuning (we tried {1e-5, 2e-5, 5-e5, 8e-5}, and found the latter 2 achieve slightly better results). For prefix-tuning with the BART model, we added prefix states to self-attention in encoder layers and self-attention and cross-attention in decoder layers. For GPT2 model, we only add prefix states to selfattention in decoder layers. For Auto-prompt with BART, we used the same 10 trigger tokens for both encoder and decoder; the trigger tokens are all initialized with mask tokens. For GPT2, we also used 10 trigger tokens. Since the model does not have mask tokens, we initialized triggers with the tokenized prompt "Based on simple commonsense fact, we know that", which is exactly 10 tokens by BPE. We train the models with batch size 32 and gradient accumulation steps 4 (we tried batch sizes {32, 128, 256} and found that larger batch size yield more stable results). At each update step, we search for the next trigger token, within the 100 closest candidate tokens, along the gradient direction (we used 10 candidate tokens for CommonGen experiments, as we found that both 10 and 100 lead to extremely bad results-so we used 10 to save computation time). A summary of the number of trainable parameters for each model-adaptation method combination is shown in Table 5.

A.1.3 Dataset splits
The ProtoQA dataset provides a dev-scraped set and a dev-crowdsourced set, where dev-scraped is collected from the Family-feud fan website, i.e., same as the training set, while the devcrowdsourced set contains newly written questions and answers by crowd-workers, i.e., same as the test set. We select the best model using the loss on dev-scraped set and report results on the devcrowdsourced set, because the test set answers are hidden and we need the ground-truth answers to test our hypothesis. In the main paper content, all references of ProtoQA dev-set refer to the devcrowdsourced set. For CommonGen dataset, we select best models, using the loss on the dev-set.
For the similarity subset of ProtoQA, we adopt the stsb-roberta-large model from the sentence-transformer (Reimers and Gurevych, 2019) library and compute the cosine similarity between the train and the dev questions.

A.2 New Questions
Examples of the original and newly written questions, along with the model predictions are shown in Table 6. Name a sport that requires a lot of equipment. Fine-tune hockey, golf, football, tennis, basketball, baseball, soccer, boxing, wrestling, volleyball Prefix-tuning football, basketball, hockey, soccer, tennis, golf, baseball, wrestling, volleyball, skiing Autoprompt hockey, soccer, golf, basketball, football, tennis, baseball, rugby, volleyball, ice hockey New Name a sport that you don't need a lot of equipment for. Fine-tune hockey, tennis, baseball, golf, soccer, football, basketball, bowling, volleyball, swimming Prefix-tuning basketball, tennis, soccer, golf, football, hockey, baseball, bowling, swimming, skiing Autoprompt basketball, soccer, hockey, golf, volleyball, tennis, football, baseball, rugby, lacrosse Original Name something around the house that's often replaced. Fine-tune tv, television, furniture, dishes, carpet, toilet paper, refrigerator, windows, stereo, lights Prefix-tuning carpet, lamp, light, furniture, tv, clothes, dishes, television, bedding, toilet paper Autoprompt TV, tv, couch, table, toilet, bed, television, microwave, chair, lamp New Name something around the house that's hardly ever replaced. Fine-tune tv, television, furniture, dishes, refrigerator, stereo, carpet, toilet paper, windows, appliances Prefix-tuning dishes, furniture, lamp, carpet, tv, clothes, bedding, TV, light, television Autoprompt TV, tv, fridge, microwave, couch, refrigerator, dishwasher, coffee table, bed, table  Original Name a job where you have to be awake at night. Fine-tune police officer, doctor, nurse, security guard, lawyer, teacher, firefighter, construction, actor, cop Prefix-tuning police officer, nurse, construction, doctor, security guard, bartender, waiter, babysitter, firefighter, teacher Autoprompt construction, work, carpenter, firefighter, truck driver, roofing, police, fireman, bartender, school New Name a job where you only have to work during the day. Fine-tune nurse, teacher, police officer, doctor, bartender, construction, waitress, lawyer, waiter, mechanic Prefix-tuning nurse, teacher, lawyer, doctor, bartender, waiter, mechanic, construction, sales, waitress Autoprompt construction, hospital, fireman, restaurant, plumber, cook, cleaning, chef, firefighter, teaching