CASE: Commonsense-Augmented Score with an Expanded Answer Space

LLMs have demonstrated impressive zero-shot performance on NLP tasks thanks to the knowledge they acquired in their training. In multiple-choice QA tasks, the LM probabilities are used as an imperfect measure of the plausibility of each answer choice. One of the major limitations of the basic score is that it treats all words as equally important. We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space. CASE addresses this limitation by assigning importance weights for individual words based on their semantic relations to other words in the input. The dynamic weighting approach outperforms basic LM scores, not only because it reduces noise from unimportant words, but also because it informs the model of implicit commonsense knowledge that may be useful for answering the question. We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices. When combined with answer space expansion, our method outperforms strong baselines on 5 commonsense benchmarks. We further show these two approaches are complementary and may be especially beneficial when using smaller LMs.


Introduction
Large language models (LLMs) have demonstrated strong few-shot and zero-shot performance across various NLP tasks, with the larger models often matching earlier fine-tuned approaches that relied on task-specific labeled data (Radford et al., 2019;Brown et al., 2020a;Touvron et al., 2023).We focus on the zero-shot setup, which assumes that the knowledge needed to perform a specific task is already present in the LLM (Petroni et al., 2019;Zhou et al., 2020;Saha et al., 2022).Zero-shot learning has been employed for tasks such as translating between unseen language pairs (Zhang et al., 2020), summarization (Brown et al., 2020a), commonsense reasoning (Shwartz et al., 2020;Klein The woman hired a lawyer because A. she decided to sue her employer.B. she decided to run for office.C. she wanted to sue her former employer. The woman hired a lawyer because ___ A. she decided to sue her employer.B. she decided to run for office.and Nabi, 2021; Liu et al., 2022;Fang et al., 2022), and more.
In multiple-choice question answering (MCQA) tasks, zero-shot methods typically rely on the language model (LM) probabilities as a proxy for plausibility, predicting the answer choice with the highest probability conditioned on the question.LM score is a naïve proxy for plausibility, since it confounds factors such as length, unigram frequency, and more (Holtzman et al., 2021;Niu et al., 2021).Indeed, in Figure 1, a GPT-2 based LM score incorrectly predicts that the woman hired a lawyer because she decided to run for office, rather than because she decided to sue her employer.
In this paper, we propose to address one of the major limitations of the LM score.By summing or averaging the token-level probabilities, the LM score treats all tokens as equally important.A person reading this question would likely pay attention to option A because the word "sue" is highly relevant in the context of a lawyer.This signal might be weaker in a basic LM score where the word "sue" is conditioned on each other token in the question and previous tokens in the answer.Furthermore, the LM might miss non-trivial connections between related words.
To address this challenge, we propose CASE: a Commonsense-Augmented Score with an Expanded Answer Space.CASE is a post-hoc dynamic weight scoring algorithm that prioritizes important words in the sentence.The importance of each individual word is determined based on its relationship with other words in ConceptNet (Speer et al., 2017).For example, ConceptNet provides the information that "sue requires having a lawyer".We use the word-level importance scores to re-weigh the LM probability scores.Indeed, in the second line of option A in Figure 1, the importance of the word "sue" increases the score of the entire sentence, leading to correctly predicting A as the correct answer.
We further adopt the strategy suggested by Niu et al. (2021) to expand the answer space by using a LM to generate additional answers and then mapping semantically-similar generated answers into the original space.This mitigates the LM score's sensitivity to infrequent words.Figure 1 demonstrates that a generated option C, "she wanted to sue her former employer", which is conceptually similar to A, further yields a higher probability score with our method.
We tested CASE on 5 popular commonsense MCQA datasets.CASE outperformed the broad range of strong baselines that we compared with, confirming that it is an effective method for zeroshot MCQA.We further study the impact of different model sizes, answer candidates of varying qualities, and different weight assignment strategies on the performance. 11 Our code is available at Github.

Plausibility Scoring
Although the plausibility score of a sentence can be easily calculated by accumulating the probability assigned by the LM for each token, this approach suffers from various statistical biases such as sensitivity to the number of tokens, subword tokenization, and word frequency (Abdou et al., 2020;Holtzman et al., 2021).To address these biases, several improvements have been proposed.With respect to the length bias, prior work normalized the score by length (Mao et al., 2019;Brown et al., 2020b), or focused on the conditional probabilities of the question, which unlike the answer choices has a fixed length (Trinh and Le, 2018;Tamborrino et al., 2020).To factor out word frequency, Holtzman et al. (2021) proposed Domain Conditional Pointwise Mutual Information (DCPMI), which normalizes the conditional probability of the answer given the question by the prior probability of the answer.This is computed as the conditional probability of the answer given a domain-specific prefix such as "The sentiment of the movie is" for sentiment analysis or "The answer is" for general QA tasks.SEQA (Niu et al., 2021) mitigates the sensitivity to word choice by generating answers using GPT-2, and selecting the answer choice most similar to the generated answers.
Existing methods solely focus on the relationship between words in the choices and words in the question, ignoring the importance of each word for the decision.In this paper, we propose a new tokenlevel weighting method to consider the importance of different words within the sentence based on their relationship to other words.

Knowledge-Enhanced Models
Zero-shot LM-based scoring methods implicitly reason about which answer is more likely based on the token-level probabilities.However, many tasks require multiple steps of reasoning to reach the correct answer (e.g., Mihaylov et al., 2018;Yang et al., 2018;Khot et al., 2020).A common approach is to retrieve relevant commonsense knowledge from knowledge bases (KBs) such as ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019a;Hwang et al., 2021), in order to enhance the neural model and explicate the reasoning steps (e.g., Bauer et al., 2018;Xia et al., 2019;Lin et al., 2019;Guan et al., 2019;Chen et al., 2020;Huang et al., 2021).More recent work used the COMET model 3).Each word in each answer is scored based on its ConceptNet relationships to other words in the instance ( §3.2).The score for each answer is based on the word probabilities ( §3.1), weighted by the word-level scores.Finally, CASE predicts the answer choice with the highest scoring answer in its group.(Bosselut et al., 2019;Hwang et al., 2021), which is a LM fine-tuned on the aforementioned KBs, to enhance models with high-coverage contextualized commonsense inferences (e.g., Majumder et al., 2020;Bosselut et al., 2021;Kim et al., 2022;Chakrabarty et al., 2022;Ravi et al., 2023).
An alternative recent approach which doesn't rely on external KBs prompts a LM to generate additional knowledge which is then incorporated back into the LM to make the prediction.Shwartz et al. (2020) and later Liu et al. (2022) used a LM to generate questions and answers about an MCQA instance.The answers to the questions are then incorporated into the LM-based scoring model as additional knowledge.Wei et al. (2022) proposed the popular chain-of-thought (COT) prompting approach in which the LM is taught through examples to generate multiple steps of reasoning followed by the answer to the question.In the zero-shot version, the LM is instructed to "think step-by-step".Finally, following concerns about the faithfulness of CoT inferences, Creswell et al. (2022) proposed to iteratively select parts of the inputs and draw inferences on them.

Method
We propose CASE, a Commonsense-Augmented Scoring method with an Expanded Answer Space.
CASE can be used for zero-shot MCQA tasks.It is based on LM score (Section 3.1).However, rather than treating all words in the context and answers as equally important, we propose a weighted score where the conditional probability is weighed by the importance of a word.The weights are determined using a commonsense KB in order to provide information that humans might implicitly be reasoning about when answering such questions (Section 3.2).Following Niu et al. (2021), we expand the set of answer candidates by generating free-text answers, to increase the scorer's robustness to lexical variability (Section 3.3).An overview of the method is shown in Figure 2.

Basic Scoring Method
The basic scoring method directly uses the LM score, which is calculated by accumulating the conditional probabilities assigned by the LM for each token given the prefix.Given a question Q = q 1 ...q n Q and an answer choice A i = a i,1 ...a i,n A i , we convert Q into a declarative statement s (see Appendix A), and define the LM score of answer choice A i as follows: where n s is the number of tokens in s.
Finally, we can determine the most plausible choice Â among the answer choices based on their corresponding scores:

Commonsense Augmented Scoring
The importance of individual words in the question and their contribution to choosing the correct answer varies greatly.Take for example the instance in Figure 1, taken from the COPA dataset (Gordon et al., 2012).Determining the cause of the event "The woman hired a lawyer" involves reasoning about the circumstances in which one might hire a lawyer, such as if they are suing someone.In this case, the keywords "lawyer" from the context and "sue" from the answer choice, and the semantic relation between them (i.e., suing someone requires a lawyer), supports the correct prediction.To that end, CASE first identifies important keywords from the question and answer choices (Section 3.2.1).Each keyword is assigned an importance score, and the conditional probability P A is updated by considering the importance of each token in the answer choice (Sec 3.2.2).

Keywords Extraction
Given a question Q and an answer choice A, we use YAKE (Campos et al., 2018), an unsupervised automatic keyword extraction method, to extract a set of keywords Key Q ⊂ Q and Key A ⊂ A.
In particular, we are interested in finding the keywords from each answer choice that are important in the context of the question Q, which we denote Key A|Q ⊂ Key A .To that end, we use ConceptNet (Speer et al., 2017), a commonsense knowledge base, to find paths between terms in Key Q and Key A , and include in Key A|Q keywords from the answer choice that are connected in ConceptNet to keywords from the question: where p denotes a path in ConceptNet (CN) with up to k edges.

Weight Assigning
We assign a weight to each token a ∈ Key A|Q based on the strength of its connection to keywords in Key Q .To that end, we look at all the ConceptNet paths that connect a with keywords in Key Q , which we denote Paths a; .We convert the path to a set of sentences by expressing each edge as a natural language sentence, based on relation templates (see Appendix B).For example, the path sue related to lawyer is expressed as S 1 = "sue is related to law" and S 2 = "lawyer is a word used in the context of law".We use the LM to score a single path P a;q as follows.First, the score S(E i ) of edge E i = (x i , R i , y i ) is calculated as the conditional probability of generating the second node y i following the textual template of relation R i , to which we assign the first node x i , such as P(law|sue is related to).We use the chain rule for conditional probability to compute the score of the entire path: where E ′ is an artificial summary edge from x 1 to y Pa;q with the "is related to" relation, such as "sue is related to lawyer".
To get an aggregated score for a token a, we sum the scores of all paths in Paths a; : S Pathsa; = Pa;q∈Pathsa; S(P a;q ) (5) Finally, the weight for each token a i,j in A i is computed as follows.
With the weights for each token, we can now update the LM score defined in Equation 1 to a weight-based plausibility score as follows:

Expanded Answer Space
The final addition to our model aims at reducing the LM sensitivity to the phrasing of the correct answer.For example, an infrequent word in the correct answer choice can reduce the overall probability of the choice and make the LM predict another option as more plausible (Holtzman et al., 2021).To mitigate this issue, we follow Niu et al. (2021) and expand the set of answer candidates by using a causal LM to generate open ended answers The idea is to allow the model to consider various phrasings of the same conceptual answer.For example, in Figure 2, the generated answer C 1 is a paraphrase of answer choice A.
We treat the generated answer choices A * the same as the original answer choices A and compute the score for each answer A * i ∈ A * using Equation 7. To map the answer choices back into the original answer space A, we attempt to match each A * i ∈ A * to A i ∈ A based on two criteria: sentence similarity and keyword connections.
Sentence Similarity.We use the Sentence-Transformer package (Reimers and Gurevych, 2019) to represent the answers, and compute the cosine similarity between the representations of each generated answer in A * and original answer in A. The similarity score between the sentence pair should be above s sim .
Keyword Connections.We calculate the connection score between the keywords in each generated answer in A * and each original answer in A using the method introduced in Sec 3.2.2.We require the connection score to be greater than 0.
A candidate can only be assigned to a group if it meets both thresholds, and we discard generated answers that are not mapped into answer choices in A. Once we mapped generated answers to original answers, the final prediction of the model modifies Equation 2 to select the highest scores of all answers within the same cluster: where A i,j is the jth answer in cluster A i .
4 Experimental Setup

Datasets
We evaluated our method on five multiple-choice commonsense question answering datasets described below.

Baselines
We compare our proposed method with the basic LM-based scoring method described in Section 3.1, as well as more advanced LM-based scoring methods described below.
Self-talk (Shwartz et al., 2020) consists of two causal LMs.The knowledge generator LM generates clarification questions conditioned on the context and pre-defined prefixes, and their corresponding answers.The scoring LM computes the probability of each answer choice conditioned on the context and question as well as the additionally generated knowledge.2DC-PMI (Holtzman et al., 2021) aims to eliminate the effect of the number of synonyms and the word frequency on the LM score by dividing the conditional probability (Eq 1) by a domainconditional prior probability for the answer choice.
SEQA (Niu et al., 2021) uses a LM to generate a set of answer candidates.These candidates then "vote" for an original answer candidate based on their semantic similarity to each candidate, and the top-voted answer is selected as the final answer.For a fair comparison with the other model, we changed the voting model from SRoBERTa N LI to the origin SRoBERTa that was not further fine-tuned on an NLI dataset.CDG (Bosselut et al., 2021) uses knowledge from COMET (Bosselut et al., 2019) to construct a local commonsense knowledge graph for reasoning and inference.
ArT (Wang and Zhao, 2022) consists of two steps: notes taking and reverse thinking.In the notes taking step, the LM generates templated inferences pertaining to key phrases in the context, which are later added as additional knowledge.The reverse thinking step aggregates the scores of different orders of the answer and question (e.g."x because y" vs. "y therefore x").

Setup and Hyper-parameters
We used GPT-2 via the HuggingFace Transformers library (Wolf et al., 2020) for the scoring part, and GPT-2 XL and GPT-3 davinci-003 for the answer space expansion step.In the keyword extraction step ( §3.2.1), we included ConceptNet paths with up to k = 3 edges.In the weight assigning step ( §3.2.2) we set the coefficient λ to 10.
In the answer space expansion step ( §3.3), we generated n A * = 100 answers from GPT-2 and n A * = 50 answers from GPT-3 for each question.Similarly to SEQA, we used nucleus sampling (Holtzman et al., 2021) with p = 0.9 and set a maximum length of 15 tokens for both LMs.We set the sentence similarity threshold to s sim = 0.5 for GPT2 x-large and s sim = 0.6 for GPT-3.
Hyper-parameter values were selected based on preliminary experiments on the training sets and were not tuned on the dev sets.

Main Results
The performance of the various scoring methods on the 5 benchmarks are presented in Table 1.For fair comparison with the baselines, the table shows the performance when GPT2 xlarge is used.We report the accuracy on the dev set.CAS stands for Commonsense-Augmented Scoring, i.e. it excludes the candidate generation.
The performance of CAS shows that weighting leads to substantial improvements upon the simpler baselines.CAS also stands out in the competition with DCPMI, which can also be regarded as a special weight-scoring method.
When combined with candidate generation, CASE outperforms nearly all baselines, except for the SocialIQA dataset, on which ArT and Self-talk perform better.Notably, both baselines rely on human-designed prompts to generate additional information, which might give them an advantage.
The gap in performance from SEQA, which also expands the answer space by generating candidate answers, further demonstrates the effectiveness of dynamic weighting.

Effect of the Scoring LM Size
Table 2 shows the performance of CAS, CASE and the simple baselines when using different sizes of GPT-2 models in the scoring part.
Bigger is better.Across the various methods, bigger LMs perform better than smaller LMs.Table 2: Accuracy when using GPT2 models with different sizes for the scoring.Takeaways: CAS consistently outperforms standard LM scoring methods, and is outperformed by CASE.For CASE, the best performance is achieved when using large GPT2 models for scoring and more importantly, GPT3 for candidate generation.
Smaller LMs gain more from candidate generation.While all LMs benefit from weighting and candidate generation, smaller LMs gain bigger improvements.For example, candidate generation with GPT-3 adds 13.4 points on COPA to a GPT2 S CAS scorer, but only 8.2 points for GPT2 XL .We hypothesize that the model performance is more sensitive to the LM quality when a single sentence is considered, while expanding the answer space makes even the lower-quality LMs more robust.

Effect of the No. of Generated Candidates
Figure 3 shows the effect of the number of generated candidates on the performance, focusing on COPA.We summarize the findings below.
Generating more candidates leads to higher accuracy.When generating few (< 20) candidates, the model's performance is unstable and relatively low.This might happen due to the generated answers being conceptually different from the original candidate answers, in which case they might not meet the mapping thresholds in Section 3.3 and be filtered out.This means that CASE effectively degenerates to CAS.Thus, it's important to generate a large number of candidates.This reassesses the findings in Niu et al. (2021).
Larger models require fewer candidates.
Larger LMs generate higher quality text which is more likely to be fluent, relevant to the context, logically correct, and consistent with commonsense knowledge.Therefore, we can expect fewer candidates to be filtered out.In addition, the generated candidates may be conceptually similar and better phrased than the original choice.

Effect of the Weighting Strategy
Table 3 compares the COPA performance of different weighting strategies.Two baselines, LM sum and LM avg , already introduced in Section 3.1, treat all tokens equally, summing or averaging the tokenlevel probabilities.Conversely, the static weighting strategy (SW and SWC, with or without candidate generation), assigns a static number (1.5) to each selected key token.Finally, the dynamic weighting strategies (CAS and CASE) not only distinguish key tokens from unimportant ones but also assign different scores to each key token based on its semantic relevance to the question.The results show that while the static weighting strategy outperforms the baseline when no additional candidates are generated (SW vs. LM), these strategies perform similarly when additional candidates are generated (SWC vs. LM+c).In both cases, Table 3: Accuracy on the COPA dev set when using different weight-assigning methods.The methods below the dotted line expand the answer space by generating additional answer candidates.Takeaway: keyword selection improves the performance, especially when it is informed by commonsense knowledge.
the static weighting strategy underperforms compared to the dynamic strategy.This result confirms that commonsense knowledge can help inform the model about the keywords that are important for the current question.

Qualitative Analysis
We focus on CASE and look at the individual token scores and corresponding ConceptNet paths to better understand the model decision-making process.
Figure 4 shows an example from SCT where CASE predicted the correct answer.The word "upset" in the correct answer choice was assigned a high weight by CASE thanks to ConceptNet paths such as upset Conversely, in Figure 5, CASE predicted the incorrect answer choice for another SCT example.The model focused on the word "left" due to its semantic relation to the word "drove", failing to understand that Priya drove to and not away from the restaurant.

Conclusion
We presented CASE, a novel LM-based plausibility score for zero-shot MCQA tasks.CASE uses a commonsense KB to assign importance weights to words in the input.The weighting strategy outperforms basic LM scoring methods.When combined with generating additional answer candidates, CASE outperforms the baselines on 5 popular MCQA benchmarks.We further showed that the two approaches are complementary and are especially beneficial when using smaller LMs.In the future, we plan to explore a more selective approach Jim is a new driver and has never been pulled over before.Yesterday he was pulled over for speeding.The officer explained to him why he was being given a ticket.Jim will have to work overtime to pay for the ticket.
Jim is very upset.Priya decided to try a new restaurant.She drove to a new cafe that opened.Priya sat at a booth.She ordered a mimosa and a breakfast burrito.
Priya decided not to eat and left.for knowledge retrieval from the KB, and adapt CASE for additional NLP tasks.

Limitations
Computational complexity.CASE is more computationally expensive than using a basic LM score, as it involves finding relevant paths from an external knowledge base and then estimating their likelihood with a LM, in order to gauge the importance of keywords.
Concept coverage.The weight assignment strategy in CASE is based on ConceptNet.The knowledge in KBs such as ConceptNet is not contextualized, which means that some facts pertaining to concepts in the instance might not be relevant to the specific context.In addition, it has limited coverage.COMET (Hwang et al., 2021) has been used in prior work (Majumder et al., 2020;Chakrabarty et al., 2020;Ravi et al., 2023) to overcome this limitation.However, finding relevant paths using COMET requires an iterative multi-hop reasoning approach (Arabshahi et al., 2021) which is more complex, and more computationally-intensive.We aim to explore efficient ways to achieve this in future work.
Answer format.Since our method assigns a weight for each word in the input, it is only ap-plicable for MCQA tasks in which the answer is a sentence.The weighting would be trivial for tasks with single word answers such as Common-senseQA (Talmor et al., 2019) and BoolQ (Clark et al., 2019).
Performance limit.Our model demonstrates a significant performance improvement over other zero-shot baselines across a majority of datasets.However, it is worth noting that the state-of-theart performance on the datasets in this paper is achieved with more supervision (i.e.supervised or few-shot models).

Ethics Statement
Data.All the datasets and knowledge bases used in this work are publicly available.We used Con-ceptNet as a source of commonsense knowledge.
Since ConceptNet was crowdsourced, some of the knowledge may contain societal biases or prejudices held by the annotators (Mehrabi et al., 2021).
Models.The GPT-2 models are publicly accessible via HuggingFace, while GPT-3 is a closed model behind an API.All language models may generate offensive statements if prompted with specific inputs, however, our model only generates text internally while the end output is a choice between human-written answer candidates.

Figure 1 :
Figure 1: An example from COPA.A and B are the original options, while option C was generated by GPT-2 as part of the answer space expansion step.The top line in each heatmap represent the LM (cross-entropy) score and the bottom line represents our CASE score.Higher scores and blue blocks correspond to lower plausibility.CASE correctly predicts option A (and option C which is an expansion of A) as more plausible than option B, while the LM-score incorrectly predicts option B.

Q:Figure 2 :
Figure 2: Overview of CASE, illustrated with an example from the COPA dataset.Groups A and B correspond to original choices A and B and any generated answers mapped to them ( §3.3).Each word in each answer is scored based on its ConceptNet relationships to other words in the instance ( §3.2).The score for each answer is based on the word probabilities ( §3.1), weighted by the word-level scores.Finally, CASE predicts the answer choice with the highest scoring answer in its group.

Figure 3 :
Figure 3: Accuracy curve of CASE on the COPA dev set, with different numbers of candidates generated from various LMs.The dotted line represents the baseline method LM sum which uses GPT2 xlarge .Takeaways: Generating more candidates leads to higher accuracy, but larger scoring LMs require fewer candidates.

Figure 4 :
Figure 4: An SCT example, along with the correct answer predicted by CASE, and an example ConceptNet path that increased the weight of the important word upset.

Figure 5 :
Figure5: A incorrectly-predicted SCT example, along with the incorrect answer predicted by CASE, and an example ConceptNet path that increased the weight of the word left.

Table 1 :
Accuracy (%) of the scoring various methods on the dev sets.All scoring methods are based on GPT-2 xlarge .CASE GP T 2 and CASE GP T 3 denote CASE with candidate generation by GPT-2 xlarge and GPT-3 respectively.Takeaway: Weighting leads to substantial improvements.When combined with candidate generation, it outperforms all baselines by a large margin.

S
GPT2 M GPT2 L GPT2 XL