A Systematic Investigation of Commonsense Knowledge in Large Language Models

Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge — a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs’ ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation is insufficient to achieve human-level commonsense performance.


Introduction
Common sense -the implicit knowledge about everyday situation that is shared by humans -is an important prerequisite for developing generalpurpose intelligent systems (McCarthy et al., 1960;Liu and Singh, 2004;Gunning, 2018).Intriguingly, recent large language models (LMs, Brown et al., 2020;Patwary et al., 2021;Rae et al., 2021) have achieved remarkable performance at various common sense benchmarks (e.g., Sakaguchi et al., 2020;Zellers et al., 2019a;Bisk et al., 2020b;Sap et al., 2019b), even when they are evaluated in a zero-shot or few-shot fashion, without explicit commonsense supervision.We revisit this apparent success, and conduct a rigorous study to better understand the extent to which such pre-trained LMs are able to capture commonsense knowledge.
Figure 1: The experiment settings with their corresponding input to the LM.The example is taken from Social IQa (Sap et al., 2019b) where we convert questions to natural text using the rules of Shwartz et al. (2020); this conversion yields to better performance ( §5).
In this work, we focus on zero-and fewshot evaluations of pre-trained LMs without commonsense-specific fine-tuning for two reasons: First, we aim to examine if a pre-trained LM is able to acquire general commonsense knowledge.As pre-trained LMs constitute a foundational building block of NLP today, any deficiencies in their commonsense understanding can thus adversely manifest in downstream applications (Bommasani et al., 2021).Fine-tuning the LM would make it hard to disentangle how much of the commonsense knowledge is acquired by the underlying LM, as opposed to the task-specific supervision from a benchmark (Yogatama et al., 2019).Second, humanannotated commonsense datasets are expensive to collect due to the vast, diverse, and growing nature of commonsense knowledge (Elazar et al., 2021).
Concretely, our work differs from prior work on commonsense evaluation of LMs (Brown et al., 2020;Patwary et al., 2021) by way of a more rigorous evaluation, in which we: (i) carefully control for the LM's ability to exploit potential surface cues and annotation artefacts to predict the answer, without reasoning over the context.We further (ii) account for the variations in factors influencing the LM's performance, which arise from certain evaluation design choices -independently of common-sense knowledge in the models.We systematically conduct this study on four commonsense benchmarks, six model sizes (up to a very large LM with 280B parameters), and multiple evaluation settings (e.g., different score functions and prompt format).
We begin with our first question: When evaluating a large LM in a zero-shot setting, how does its zero-shot performance compare to a strong baseline ( §3)? Controlling for the LM's ability to guess the correct answer, without even looking at the question (Poliak et al., 2018;Trichelair et al., 2019, Answer-only baseline, top of Fig. 1), we find that, despite the LM's strong zero-shot performance, the Answer-only baseline can nevertheless perform surprisingly well on some benchmarks.Despite the clear importance of comparing with answer-only baselines as shown in Figure 2, these comparisons are absent from recent work on large LMs (Zhou et al., 2020;Brown et al., 2020;Rae et al., 2021).Furthermore, increasing model size alone is unlikely to bridge the gap with human performance in the near future: Our analysis of scaling behavior suggests that much larger dense LMs (with 100T to 10 18 parameters -which are infeasibly large at present) are needed to achieve human performance for 3 out of 4 benchmarks.Does familiarizing the LM with the task format using a few-shot evaluation setting substantially improve performance ( §4)?We find that the fewshot evaluation (using up to 64 examples) does not substantially improve the LMs' performance for most tasks except Social IQa.Moreover, using the few-shot/in-context demonstration setting fails to bridge the gap between the LM and current SOTA.
Finally, we ask: to what extent does the model's zero-shot performance vary depending on certain evaluation design choices, such as the format of the prompt or the score function ( §5)?We find that these design choices -though they have little to do with common sense -can result in large fluctuations in performance (up to 19%).This finding challenges the notion that large LMs are largely able to work well out-of-the-box with minimal taskspecific tuning.Based on these findings, we emphasize the need to carefully select such design choices, explicitly state them to enable fair comparison with prior work, and quantify the robustness of the observed results across different design choices.
All in all, our findings suggest that acquiring human-level commonsense knowledge, without relying on surface cues or task-specific supervision,
Given the marginal improvements from increasing model size, we conjecture that other techniques, such as explicit commonsense supervision, multimodal grounding, or physical embodiment (Bisk et al., 2020a), are promising ways forward.

Experimental Setting
We begin by outlining our experimental setup, and describe the benchmarks, model, baselines, and other relevant experimental settings.

Commonsense Benchmarks
Commonsense knowledge spans many categories, such as physical common sense (e.g., a car is heavier than an apple), social common sense (e.g., a person will feel happy after receiving gifts), and temporal common sense (e.g., cooking an egg takes less time than baking a cake).Given this diverse nature of commonsense knowledge, various benchmarks have been proposed to test these different types of knowledge (e.g., Zellers et al., 2019a;Sakaguchi et al., 2020;Sap et al., 2019b;Bisk et al., 2020b;Lin et al., 2020;Boratko et al., 2020).
Commonsense benchmarks broadly consist of two tasks: (a) multiple-choice evaluation (Zellers et al., 2018(Zellers et al., , 2019a;;Sap et al., 2019b;Bisk et al., 2020b), where a model needs to choose the correct answer from a list of plausible answers; (b) generative evaluation (Boratko et al., 2020;Lin et al., 2020Lin et al., , 2021)), which requires a model to generate an answer given a question and some additional context.Here we focus on multiple-choice benchmarks, since they provide a more reliable automatic metric (i.e., accuracy), whereas automated metrics used to evaluate language generation (e.g., BLEU, Papineni et al., 2002) do not correlate perfectly with human judgment (Liu et al., 2016;Novikova et al., 2017). 1 We use a diverse set of four representative multiple-choice commonsense benchmarks to better understand the extent to which pre-trained LMs are able to acquire different types of commonsense knowledge.We use the validation split of each benchmark, as their test splits are not public.HellaSwag (Zellers et al., 2019a) is designed to evaluate a model's ability to understand physical, grounded, and temporal common sense.Given a four-sentence story, the model must choose the correct ending from four candidates.The stories are either video captions from AcitivityNet (Heilbron et al., 2015), or WikiHow passages (Koupaee and Wang, 2018).When evaluating LMs on a similar dataset (Zellers et al., 2018), incorrect answers can be easy to distinguish from correct ones; hence in constructing HellaSwag, Zellers et al. (2019a) removed easy negatives through adversarial filtering.
WinoGrande (Sakaguchi et al., 2020) is a coreference resolution benchmark that mainly examines physical and social common sense.Each example consists of a sentence (e.g., "The trophy did not fit the suitcase because it is too big.")and two candidate entities (e.g., "trophy" or "suitcase").The task is to choose the correct entity for the pronoun, e.g., "it" refers to "trophy" in the example.Social IQa (Sap et al., 2019b) focuses on evaluating social commonsense, in particular theory of mind -the capacity to reason about others' mental states (Flavell, 2004).Given context sentences and a corresponding question, the task is to choose the correct response from three candidates.Annotators use the ATOMIC knowledge base (Sap et al., 2019a) to create context sentence and questions; the answers are provided by additional annotators.PIQA (Bisk et al., 2020b), short for physical interaction question answering, mainly covers the physical aspect of common sense.Each data point consists of a task and two alternative solutions to finish the task; one of which is correct.The tasks are curated from a website 2 with instructions for everyday tasks (e.g., separating egg yolks from eggs); the solutions are provided by human annotators.

Pre-trained Language Model
We use the pre-trained language model of Rae et al. (2021), Gopher, which is an autoregressive Transformer (Vaswani et al., 2017) language model with 280 billion parameters.We choose Gopher because of its excellent zero-shot and few-shot performance at various benchmarks, in addition to its large model size, which has been shown to improve 2 https://www.instructables.com/language modeling and downstream performance (Kaplan et al., 2020).Notably, Gopher is more than 50% larger than GPT3 and as of March 2022, is one of the largest dense LMs developed to date.
Gopher hyper-parameters.The pre-trained Gopher language model has 80 layers, 128 attention heads, 128-dimensional key/value vectors, and a feedforward layer dimension of 16,384.To better understand the effect of different model sizes ( §3.2), we experiment with five other model sizes: 44M, 117M, 417M, 1.4B, and 7.1B.Similar to Gopher, each of these models was pre-trained by Rae et al. (2021); a full list of model hyper-parameters is summarized in Table 1 of Rae et al. (2021).Each model is trained by subsampling from the Mas-siveText dataset, which consists of more than 2 trillion tokens from various domains including web pages, news, books, and codes (Rae et al., 2021).The authors have removed documents that overlap significantly with the evaluation sets from training set including benchmarks used in our work.We use TPUv3 to conduct all evaluations, with an estimated total compute budget of 2 × 10 20 FLOPs.Score function.On the multiple-choice benchmarks, we evaluate the pre-trained LM by calculating the score for each answer choice under the model, and select the highest-scoring answer ŷ: s θ (y|x); here x denotes the question or prompt, Y (x) the set of answer choices for a given question, and s θ (•) the score of an answer choice y given x, under the pre-trained LM with parameters θ.We provide some examples in Table 2. 3 For Social IQa, we convert questions to natural text using the rules of Shwartz et al. (2020); we find this natural text format to yield better results, as discussed in §5.
Unless otherwise stated, we use cross-entropy (or token-level log prob) to score each answer: This score function reduces the impact of length; without dividing by ∥y∥, longer answers might have lower probabilities (Stahlberg and Byrne, 2019).GPT3 (Brown et al., 2020) also employs this score function for zero-shot evaluation.

Dataset
Prompt: x Answer: y HellaSwag A woman is outside with a bucket and a dog.The dog is running around trying to avoid a bath.She gets the dog wet, then it runs away again.
WinoGrande The GPS and map helped me navigate home.I got lost when the GPS got turned off.

Social IQa
Jordan was in charge of taking the food on the camping trip and left all the food at home.Jordan felt horrible that he let his friends down on the camping trip.
PIQA Make Halloween lanterns.Draw ghost faces on empty milk bottles, put a candle in each one.
Table 2: Examples of the prompt x and the correct answer y in different benchmarks.

Baselines
We compare the performance of Gopher with two baselines.The first, simple baseline is to randomly select an answer candidate, where the chance of selecting the correct one is 1 number of choices .We henceforth refer to this as the Random Baseline.We experiment with two other baselines: Either choosing the majority label from the training data, or choosing the longest answer.We omit these baselines as they perform similarly to the Random Baseline.
More importantly, we consider an Answer-only Baseline, where we select the highest-scoring answer choice under the LM, without conditioning on the question.More formally, this baseline considers s θ (y), as opposed to s θ (y|x) in Eq. 1.This baseline reveals the extent to which the pre-trained LM conducts the appropriate reasoning over the context to select the answer, as opposed to relying on potential surface cues or annotation artefacts that make the correct answer a priori more probable than the rest.We illustrate this baseline at the top of Fig. 1.For WinoGrande, we calculate the cross-entropy of the text starting by the pronoun replacement, as shown in Table 2. Ideally, each answer choice should be equally likely if we do not consider the question, and the Answer-only performance should be close to the Random baseline.Similar hypothesis-only baselines are wellstudied for natural language inference datasets (Poliak et al., 2018); Trichelair et al. (2019) further explored such an Answer-only baseline, albeit only on the SWAG benchmark (Zellers et al., 2018).
Zero-shot performance.At first glance, we observe strong zero-shot results, outperforming the Random Baseline in all benchmarks (compare "Rand" and "ZS" in Fig. 2).However, the gap between the stronger Answer-only baseline and the zero-shot result is smaller for all benchmarks (compare "Answer" and "ZS"): Whereas this gap is still sizable for HellaSwag and WinoGrande (>20), it is much smaller for Social IQa and PIQA.Finally, in all cases, there is still a large gap between the SOTA and zero-shot performance (>10); this gap is largest for WinoGrande and Social IQa, suggesting that social and physical commonsense is challenging for pre-trained LMs -even a large one with 280B parameters -without task-specific supervision.4

Answer-only bias
As shown in Fig. 3, the performance gap between the Random and Answer-only baselines is notably large for HellaSwag and PIQA, where the Answeronly baseline outperforms the Random baseline by more than 32% and 23%, respectively.This large gap highlights an existing answer-only bias in these benchmarks: the correct answer can, in fact, be selected by the LM without conducting the appropriate commonsense reasoning over the provided context.On the other hand, the Answer-only baseline performs similarly to the random baseline on WinoGrande and Social IQa; hence, the zero-shot performance on these benchmarks is a more reliable estimate of the model's acquisition of  commonsense knowledge.Given the existing (and sometimes inevitable) answer-only biases in some benchmarks, it is important to contextualize the zero-shot results by comparing with strong baselines, although such comparisons are missing from recent work (e.g., Zhou et al., 2020;Brown et al., 2020;Rae et al., 2021).

Does Increasing Model Size Help?
Gopher (the largest LM we have access to) achieves a decent zero-shot performance for most commonsense benchmarks, but maintains a notable gap with fine-tuned SOTA results.Can we eventually reach human-level performance on these commonsense benchmarks by increasing model size alone?Since we do not have access to larger language models than Gopher, we examine the extent to which zero-shot performance improves when using Gopher compared to a range of smaller models (i.e., scaling plots).Such scaling plot can help us predict the performance for larger models than Gopher.To that end, we use 6 pre-trained model sizes from 44M to 280B parameters (see §2.2). 5  We present the findings in Table 3.On all four 5 Each model size is trained on the same dataset; hence any performance differences can be attributed to model size.benchmarks, the LM's zero-shot performance (Table 3, ZS column) consistently gets better as we use increasingly larger models.This finding is also consistent with that of Brown et al. (2020), who showed that larger models have better performance at HellaSwag, WinoGrande, and PIQA.But, crucially, we argue that this does not necessarily mean that larger models are better at commonsense reasoning: For HellaSwag and PIQA, the Answer-only baseline also substantially improves with model size (Table 3, Answer column).Hence, for these benchmarks, larger models are also better at exploiting potential surface cues and annotation artefacts to guess the correct answer, without reasoning over the context.To properly assess commonsense reasoning, we should focus on the performance difference between the zero-shot and the Answer-only baseline.We plot this performance difference with respect to different model sizes in Fig. 4. We observe that larger models have better performance across benchmarks -when increasing model size, the zero-shot performance gains are more than the performance gains of the Answer-only baseline.Nevertheless, the magnitude of this improvement varies depending on the benchmark: We see a substantial improvement on WinoGrande, but smaller improvements on HellaSwag, Social IQa and PIQA.
Scaling behavior.Based on these trends, what model size would be required to achieve humanlevel performance on these benchmarks?Through a linear regression analysis (see Appendix B for more details), given the current rate of improvement in performance when gradually increasing the model size from 44M up to 280B, we need a model of at least 1.4T parameters to achieve human performance on HellaSwag, and a model of >100T parameters (∼400x larger than Gopher) for other benchmarks.This result suggests that training everlarger models may not help us reach human performance, at least in the near future.Indeed, given the enormous compute costs for training even larger LMs than the Gopher model with 280B parameters, we conjecture that there are more efficient ways of acquiring commonsense knowledge in an unsupervised fashion, for instance through multi-modal learning and grounding (Bisk et al., 2020a).

Few-shot Performance
Recent work has shown that large LMs can perform surprisingly well at various tasks in a fewshot fashion (Brown et al., 2020;Patwary et al., 2021).Under this setup, the model is provided with n examples of the downstream task, which are then appended to the prefix.Concretely, for the four commonsense benchmarks, we append n examples that include the question and the correct answer; these examples -which are randomly sampled from the training split of each benchmark -appear before the evaluated question, as shown in Fig. 1.This few-shot formulation is appealing as it relies only on a small number of task-specific examples to get the LM accustomed to the task, without any fine-tuning.To what extent can we improve the model performance on commonsense benchmarks, by shifting from the zero-shot to the few-shot evaluation protocol? 6n Fig. 5, we compare the performance of Gopher under different evaluation protocols: (i) zeroshot and (ii) few-shot (n) where we use n ∈ {1, 10, 64} examples.We run the few-shot experiments between 5 and 10 times -sampling different examples each time -and report the average performance.The variance across runs is very small and is shown as the error bar in Fig. 5. 7Interestingly, model performance with few-shot (1) is sometimes worse than the zero-shot model, but the few-shot (10) and (64) models outperform their zero-shot counterpart (albeit sometimes by small margins).On HellaSwag and PIQA, we do not observe substantial improvement from few-shot evaluation compared to the zero-shot baseline (less than 2%).8While few-shot evaluation does not help much for most datasets, the only exception is Social IQa, where the few-shot (64) model outperforms the zero-shot model by a > 7% margin.We attribute this to the less natural text of Social IQa;9 hence adding task-specific examples provides information about what is expected of the task.We additionally report the error bars, although the error bars are not always visible due to the very small variance.
Overall, we observe that the usefulness of the few-shot setting is benchmark dependent.Moreover, using task-specific examples in a few-shot setting does not bridge the gap to SOTA or human performance for any of the benchmarks.
Knowledge base retrieval.We further examine if adding pre-extracted commonsense knowledge base triplets to the context -as a different form of few-shot/in-context learning -helps improve model performance.(See Appendix D for details.)In contrast to work of Shwartz and Choi (2020), we observe no improvements when appending the triplets; we attribute this discrepancy to the strong performance of our base models (see §5).

Robustness of Reported Results
Different evaluation design choices -such as the format of the prompt or the choice of score functions -can impact the LM's zero-shot performance, and crucially result in different conclusions about a model's commonsense understanding ability.Moreover, the lack of a standardized zeroshot LM evaluation protocol makes direct comparisons between papers difficult (Shwartz et al., 2020;Bosselut et al., 2021).To what extent can we attribute variance in the reported results to these evaluation design choices -even though they have little to do with commonsense knowledge?
Model.Quantifying the robustness of the reported results necessitates scoring a large number of examples under different evaluation design choices, which is infeasible to do with the largest (280B-parameter) model that has a slow inference speed.Hence, we conduct the following experiments using the 7B-parameter model, which is still ∼5 times larger than GPT2 (Radford et al., 2019).

Score functions.
Prior work employs different score functions to assess the plausibility of each answer choice given a question (Brown et al., 2020;Shwartz et al., 2020;Bosselut et al., 2021;Holtzman et al., 2021), which makes a direct comparison between different results challenging.Here we investigate the impact of different score functions on the reported performance.In addition to crossentropy (defined in §2.2), we experiment with two other score functions.The first is sequence log probability, defined as the log probability of the answer choice y conditional on the question x.Letting y i be the i-th token in the answer y: Another widely used score function (Bosselut et al., 2021;Holtzman et al., 2021) is point-wise mutual information.This score function takes into account the probability of the answer choices alone, and the probability of the answer choices conditional on the question.This metric assesses whether the question adds additional information, as commonsense reasoning should be established within the context of the question.As this score function accounts for the prior probability of answer options, it can yield lower accuracy than score functions like cross-entropy that do not account for such factor (Answer-only baseline, §2.3).
Prompt format.Another important factor is the format of the prompt; here we consider a few such choices.In addition to the concatenation of the question and the answer, we experiment with adding special symbols "[Question]" and "[Answer]" to specify the question and the answer (Brown et al., 2020).Moreover, for Social IQa and PIQA, we experiment with a set of predefined rules (taken from Shwartz et al., 2020) to convert the questions into sentences, which are closer to the LM's pre-training data format.Finally, we find that having the correct lower/upper case and punctuation is important; thus we manually checked all benchmarks to correct for case and punctuation. 10 Scored text.The next option is whether to score the entire question-answer pair (Shwartz et al., 2020), or only the answer choice (conditional on the given question as prefix) as done by Brown et al. (2020) i.e., whether to calculate s(x; y) or s(y|x), where ; implies text concatenation.

Do These Design Choices Matter?
Table 4 shows the performance difference of using the worst versus the best design choices, which are independently optimized for each task.To sweep over the above design choices, instead of considering all combinations of parameters, we iterate the options in one category (e.g., score function), while fixing the parameters in the other categories. 11Overall, we observe a difference between the best and worst settings on all benchmarks; this gap is especially large for HellaSwag and PIQA.This result shows that large language models do not simply work out of the box for some commonsense benchmarks, because for some tasks, these evaluation design choices can account for a large variation in model performance.We find that the score function plays the most important rolecross-entropy yields the highest accuracy values across most benchmarks, but sequence log probability achieves a slightly better performance for WinoGrande.However, when using these scores, we should account for the Answer-only baseline ( §3).Moreover, converting questions to sentences makes the largest difference for Social IQa.We also find that scoring the answer conditional on the question -as opposed to scoring the concatenation of questions and answers -works best, except for WinoGrande, which has no questions. 10Recent work learns the prefix that would maximize performance (e.g., Li and Liang, 2021).Here we focus on evaluation setups with no parameter updates, and leave this extension to future work.Our findings also indicate that the score function choice -which is not covered by lightweight fine-tuning approaches -is more important than the prompt format ( §5.1).
11 This decision saves compute resources, while offering a lower bound on the performance variations.Our goal here is not to seek the highest achievable performance, but to understand how much performance varies across different settings.Answer-length bias.Although cross-entropy generally achieves the best reported performance, this score function is sensitive to answer lengths.
As shown in Appendix C, cross-entropy tends to assign higher scores to longer answers; to varying extent, this pattern holds for PIQA, Social IQa, and WinoGrande.We attribute this to the higher probability assigned to subsequent tokens in the sequence, as such tokens have the most context and thus can be more easily predicted than tokens in the beginning of the answer.As longer answers have more such easier-to-predict tokens, their crossentropy tends to be lower.This pattern is reversed in metrics such as sequence log probability, where shorter sequences often have higher scores (Koehn and Knowles, 2017;Stahlberg and Byrne, 2019).Note that this bias does not change the results reported in this work since there is no correlation between answer length and correctness (Appendix C).
Takeaways.We conclude this section with three concrete recommendations for future work.
• Although cross-entropy often achieves the best performance, it does not take into account the probability of selecting the correct answer without reasoning over the context ( §3).We recommend future work to either: (i) use cross-entropy and report the gap with the answer-only baseline, or (ii) use the PMI score function, which already takes the probability of the answer into account.
• In the same way that we search for the best model hyper-parameters, future work should search over certain important evaluation design choices, such as the format of the prompt, and whether to convert the questions into declarative sentences.
• Lastly, we strongly encourage future work to report the variance of the observed results across different design choices.This can provide an indication of the robustness of the language models' performance on commonsense benchmarks.

Related Work
While recent work evaluates LMs against commonsense benchmarks in a zero-and few-shot fashion, they do not examine the extent to which model performance can be attributed to superficial cues or annotation artefacts in a given dataset (e.g., through strong baselines), nor do they quantify how robust the model performance is under different evaluation design choices.Trichelair et al. (2019); Elazar et al. (2021) investigate the existence of dataset bias in commonsense co-reference resolution benchmarks (Levesque et al., 2012;Sakaguchi et al., 2020) and SWAG (Zellers et al., 2018); here we conduct a more comprehensive investigation on four diverse commonsense benchmarks.Another line of work probe for commonsense knowledge in LMs through knowledge base completion (Petroni et al., 2019;Davison et al., 2019) or manually-designed probing tasks (Weir et al., 2020;Shwartz and Choi, 2020).Zhou et al. (2020) evaluate pre-trained LMs against commonsense benchmarks and propose a new dataset requiring multi-hop reasoning.In contrast, we focus on zeroand few-shot evaluation of commonsense understanding using the existing benchmarks.

Conclusion
We conduct a systematic and rigorous study of large LM performance on a diverse set of commonsense benchmarks, in a zero-shot and few-shot fashion.While pre-trained LMs can seemingly achieve a good zero-shot performance on these benchmarks, these results can be partially attributed to the LM's ability to exploit potential surface cues and annotation artefacts to guess the correct answer, without reasoning over the provided context.We further observed that substantially increasing model size yields rather small improvements on most commonsense benchmarks: Based on the scaling plots, achieving human-level performance requires much larger model sizes than what is currently feasible.In addition, model performance can be highly sensitive to certain evaluation design choices.Overall, our findings offer valuable insights and best practices for rigorously evaluating large LMs.

Ethical Considerations
The primary aim of this paper is to conduct a systematic and rigorous commonsense evaluation of a large language model, which -in the case of this work -is achieved by using the pre-trained Gopher language model (Rae et al., 2021) with 280B parameters.Hence, the same risks stemming from large language model research are also broadly applicable to this work (Bender et al., 2021).We briefly discuss these ethical considerations below.
Training compute.In practice, pre-training large language models like Gopher requires an enormous amount of compute, which may contribute to increased carbon emissions (Strubell et al., 2019;Patterson et al., 2021).In this work, we do not pretrain the language model from scratch, although we acknowledge that conducting inference and evaluation with large language models like Gopher still has substantial computational costs.Given the need to construct even-larger language models (>100 trillion parameters) to achieve human-level performance on most of these benchmarks in an unsupervised fashion ( §3.2), we encourage future work to focus on potentially more efficient ways of acquiring commonsense knowledge directly from data, e.g., through multi-modal learning, grounding, and human interaction (Bisk et al., 2020a).
Fairness and bias.Given the enormous size of the pre-training data -about 2 trillion tokens in the case of Gopher pre-training -it is conceivable that the training dataset may inadvertently contain toxic and biased material.Such toxic materialwhich is not always easily identifiable in the large training dataset -can in turn encourage the model to produce biased, harmful, or toxic output, especially when they are prompted with toxic text (Gehman et al., 2020).In fact, Rae et al. (2021) demonstrated that -up to a certain model size -larger language models may respond to toxic prompts with greater toxicity compared to smaller ones.Furthermore, the enormous size of the training data does not necessarily guarantee diversity: We expect the training data to contain a smaller proportion of vernacular or regional English that is used by underrepresented communities (Blodgett et al., 2016;Bender et al., 2021).Furthermore, the language model may also acquire harmful biases and stereotypes, e.g., assign lower probabilities to women becoming doctors as opposed to men (Rudinger et al., 2018;Cao and Daumé III, 2021).
Language model misuse.Our work highlights both the success and limitations of large language models at multiple commonsense benchmarks.Nevertheless, the success and expressive power of large language models come at the expense of potential misuse.Given their ability to generate realistic-looking -albeit not necessarily factual -content, large language models can also be used for malicious purposes.For instance, large language models can be used to generate convincing fake news (Zellers et al., 2019b), and more powerful generator can in turn generate even more convincing and influential fake news.Given the difficulty of manually distinguishing between humangenerated text and machine-generated ones (Clark et al., 2021), how we can better detect and defend against malicious use of large language models is an important and exciting avenue for future work.

Limitations
There are limitations to this work: first, we only assessed models' performance on multiple-choice questions (and not in a generative setting).Multiple choice problems have a more reliable automatic metric; in contrast, metrics used for generative tasks do not always accurately reflect human judgment (Clark et al., 2021) Second, we only evaluate the benchmarks on one family of models, the Gopher models and their variants; given the computational cost and also the lack of availability of different large language models (LLM), we cannot run our experiments on different model families than Gopher.However, we include zero-shot results on common-sense benchmarks from existing work on other LLMs in the paper (such as the GPT2 result in Table 7).Moreover, LLMs behave very similarly on various benchmarks, and we expect our results to generalize to other LLMs as well.Last but not least, we only evaluate models that are solely trained on language.Recent multimodal models have shown impressive performance on a range of tasks (Saharia et al., 2022).Will models trained on multiple modalities have more commonsense?We aim to answer this question in future work.
C Cross-entropy vs answer length for all datasets

D Commonsense Knowledge Bases
Given the implicit nature of commonsense knowledge, a language model's pretraining corpora might not contain all of the supporting evidence that is required to answer commonsense understanding questions -a phenomenon widely known as the reporting bias problem (Gordon and Van Durme, 2013).Thus, prior work has proposed to use external knowledge bases for improving the zero-shot performance of LMs on commonsense benchmarks (Bosselut et al., 2021;Bauer and Bansal, 2021).These approaches are particularly interesting, as the knowledge base augmentation only happens at test time, rendering this approach compatible with any pretrained generative LM.While prior work has shown the effectiveness of this approach over the zero-shot baseline that lacks access to commonsense knowledge bases (CSKBs), we find that the performance of the baseline model is highly sensitive to certain evaluation design choices ( §5).A natural question, therefore, is the following: If we carefully optimize the evaluation design choices of the baseline model, would we still observe similar improvements through CSKB augmentation?
Setup.To answer this, we replicate prior work by adding commonsense knowledge base entries at test time; such knowledge base triplets can potentially provide the relevant implicit commonsense knowledge that makes the correct answer more likely than the rest.To ensure the generality of our findings, we apply this approach to multiple model sizes that we explored in §3.2.Here we consider the pre-extracted knowledge base triplets that are made publicly available by Shwartz et al. (2020).
We use a similar score function as Shwartz et al. (2020), where, for each answer choice y ∈ Y (x), we choose the knowledge base triplet that yields the highest score:12 s kg (y|x) ≜ t∈T s(y; t|x) ≈ max t∈T s(y; t|x), where s(y; t|x) denotes the cross-entropy of the concatenated answer choice y and the extracted knowledge base triplet t, conditional on the question/context x.Here T denotes the set of all extracted commonsense knowledge triplets, which are generated from Comet (Bosselut et al., 2019).Shwartz et al. (2020).ZS: zero-shot performance; CN: ConceptNet.We do not include the Gopher results -with 280B parameters -due to computational considerations and much slower inference.
One key difference is that we score the answer and knowledge base triplet conditional on the question, whereas Shwartz et al. (2020) scored the concatenation of question, answer, and triplet instead.
In Table 7, we summarize our results on Social IQa, which has the highest gap between the zeroshot and SOTA performance (Fig. 2).We compare our results with those of Shwartz et al. (2020), who used GPT2 as the base model.Our results in Table 7 provide an interesting contrast to the findings of Shwartz et al. (2020): Our baseline zeroshot model with 1.3B parameters achieves an accuracy of 47.0% on Social IQa, substantially outperforming the reported GPT2 result of Shwartz et al. (2020) -which achieves 41.1% -despite the fact that GPT2 has more parameters (1.5B vs our 1.3B).In fact, the same 1.3B zero-shot model -which does not benefit from any commonsense knowledge base triplets -nearly matches the performance of GPT2 augmented with Comet (Bosselut et al., 2019) (47.0% for our zero-shot 1.3B model vs 47.5% for GPT2 augmented with COMET; Table 7), and also outperforms the GPT2 model that is augmented with self-talk.Nevertheless, we find that adding knowledge base triplets fails to yield substantial improvements for our models; this finding is consistent across three different knowledge bases and five model sizes.On the contrary, adding such knowledge base triplets can occasionally decrease performance compared to the zero-shot baseline.
We remark on two significant aspects of our findings.First, it is important to compare proposed improvements against strong, well-tuned baselines (Henderson et al., 2018;Melis et al., 2018), which can achieve surprisingly competitive performance.We identify the choice of the scored span as a particularly important design choice: Whereas Shwartz et al. (2020) scored the GPT2 model on the concatenation of both question and answer, we instead calculate the cross-entropy of the answer given the question.Second, certain improvements that are observed under a particular set of evaluation design choices may not necessarily be replicated under a different set.This finding reiterates the importance of explicitly stating the evaluation design choices used in each experiment, and identifying whether or not the observed improvements are robust across different evaluation design choices ( §5).

Figure 4 :
Figure 4: The difference between zero-shot performance and Answer-only baseline for different model sizes.

Figure 5 :
Figure 5: Accuracy on the benchmarks for zero-shot (ZS) and few-shot (FS) settings (with 1, 10, and 64 examples).We additionally report the error bars, although the error bars are not always visible due to the very small variance.
(a) Answer length vs cross-entropy (average log probability across tokens) for PIQA.(b) Answer length vs cross-entropy (average log probability across tokens) for SocialIQA.(a) Answer length vs cross-entropy (average log probability across tokens) for HellaSWAG.(b) Answer length vs cross-entropy (average log probability across tokens) for Winogrande.

Table 1 :
Benchmark Statistics.Choices: the number of candidate answers for each question; Questions: the number of candidate answers for each question.

Table 3 :
Performance of all models across benchmarks under different experimental settings.Ans: Answeronly Baseline; ZS: zero-shot performance; FS(n): fewshot performance where n is the number of examples.

Table 4 :
The performance difference between the worst and best design choices for each benchmark.

Table 7 :
Zero-shot performance on Social IQa when using different knowledge bases.GPT2 results are taken from