CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering

The task of zero-shot commonsense question answering evaluates models on their capacity to reason about general scenarios beyond those presented in specific datasets. Existing approaches for tackling this task leverage external knowledge from CommonSense Knowledge Bases (CSKBs) by pretraining the model on synthetic QA pairs constructed from CSKBs. In these approaches, negative examples (distractors) are formulated by randomly sampling from CSKBs using fairly primitive keyword constraints. However, two bottlenecks limit these approaches: the inherent incompleteness of CSKBs limits the semantic coverage of synthetic QA pairs, and the lack of human annotations makes the sampled negative examples potentially uninformative and contradictory. To tackle these limitations above, we propose Conceptualization-Augmented Reasoner (CAR), a zero-shot commonsense question-answering framework that fully leverages the power of conceptualization. Specifically, CAR abstracts a commonsense knowledge triple to many higher-level instances, which increases the coverage of CSKB and expands the ground-truth answer space, reducing the likelihood of selecting false-negative distractors. Extensive experiments demonstrate that CAR more robustly generalizes to answering questions about zero-shot commonsense scenarios than existing methods, including large language models, such as GPT3.5 and ChatGPT. Our codes, data, and model checkpoints are available at https://github.com/HKUST-KnowComp/CAR.


Introduction
Pre-trained Language Models (PLMs; Devlin et al., 2019;Clark et al., 2020) fine-tuned on task-specific training sets achieve remarkable near-human performance on held-out test sets, yet struggle to generalize to examples that are distributionally different * Equal Contribution  (Ma et al., 2021).The simple heuristic used in this process can result in false negative options.
from their training sets (McCoy et al., 2019;Ma et al., 2019;Zhou et al., 2021;Wang et al., 2021).This discrepancy arises because fine-tuned PLMs often rely on spurious, dataset-specific correlations to learn a task rather than learning to fully leverage implicit commonsense knowledge required for reasoning (Branco et al., 2021).For reasoning systems to be effective, though, they must be robust across domains and generalize beyond the specificities of individual datasets.
To confront the generalization issue in commonsense reasoning tasks, the task of zero-shot commonsense Question-Answering (QA) requires models to answer questions for evaluation benchmarks without access to their corresponding training data (Shwartz et al., 2020;Li et al., 2020).Among several methods that tackle this task, the most performant ones inject commonsense knowledge from CSKBs (Hwang et al., 2021;Jiang et al., 2021) into PLMs by fine-tuning them on synthetic QA pairs transformed from commonsense knowledge triples, where the head and relation are transformed to a question, and the tail serves as a ground answer.Negative examples are randomly sampled with keyword-overlap constraints (Ma et al., 2021).Such knowledge injection benefits not only QA (PersonX played a football game, xWant, take a rest) (played a football game, IsA, Sport) (played a football game, IsA, Tiring event) … (played a football game, IsA, Exercise) Figure 2: An example of conceptualization inference.More abstracted knowledge, such as (Do sport, xWant, take a rest), can be obtained through conceptualization.
tasks that are derived from CSKBs, such as So-cialIQA (Sap et al., 2019b), which is derived from ATOMIC (Sap et al., 2019a), but also QA datasets in other domains (Bisk et al., 2020).
Despite recent advancements in this area, two major challenges remain.First, manually curated CSKBs, such as ATOMIC, are incomplete (Kuo and Hsu, 2010).While consolidating multiple CSKBs can improve coverage, it remains infeasible to cover all conceivable knowledge for the vast range of entities and situations in the real world (He et al., 2022).Automatic methods for expanding CSKBs exist, such as knowledge base completion (Li et al., 2016;Malaviya et al., 2020), and knowledge distillation from large language models (West et al., 2022;Gao et al., 2023), but they either fail to provide knowledge about novel entities or only provide highly accurate yet less informative knowledge (e.g., vague adjectives, such as happy, as situation descriptors).Second, in zero-shot commonsense QA, negative examples are required for models to learn to distinguish the validity of commonsense scenarios (Chen et al., 2023a).However, existing negative QA examples are synthesized using simple heuristic-based negative sampling without considering deeper semantics, resulting in too many false negative options.For instance, in Figure 1, "have a drink" is also plausible in the context of "after playing a football game."These questions that label plausible options as negative instances confuse the model during training, impeding its ability to discern correct commonsense knowledge.
We tackle both of these challenges by utilizing conceptualization.As Murphy (2004) posits, humans rely on conceptual induction to draw inferences about unseen situations without the need for memorizing specific knowledge.Conceptualization (He et al., 2022) offers a similar capability by abstracting a set of instances into concepts, which allows for the derivation of abstract commonsense knowledge associated with each concept that can be instantiated to assist reasoning on specific downstream situations.For example, in Figure 2, "play a football game" can be conceptualized as a tiring event, which further generalizes as abstract knowledge.The benefits of conceptualization are twofold.First, conceptualized commonsense knowledge introduces abstract knowledge through a one-step concept inference based on the original CSKB, enhancing knowledge coverage.Second, as the abstract knowledge is conditioned on the original knowledge, the recall of knowledge regarding the same head is increased, leading to more finegrained constraints for negative option sampling.
Inspired by these advantages, we propose CAR (Conceptualization-Augmented Reasoner), a simple yet effective zero-shot commonsense QA framework that leverages conceptualization to expand existing CSKBs and reduce false-negative distractors.We first augment the original CSKB with conceptualization to infuse abstract commonsense knowledge to improve knowledge coverage.Then, we propose a conceptualization-constraint sampling strategy that generates distractors with conceptlevel constraints to prevent false negative options (Section 4).Experimental results on five popular commonsense QA benchmarks demonstrate the effectiveness of CAR, which even surpasses GPT3.5 and ChatGPT (Section 5).In Section 6, we analyze why CAR works by providing human evaluations that show a significant reduction of false negative options compared to other methods.Finally, our analysis reveals that conceptualizationaugmented training examples tend to be more ambiguous (Swayamdipta et al., 2020) than those produced by prior heuristics, leading to better out-ofdomain generalization.

Related Works
Zero-shot Commonsense QA.Zero-shot commonsense QA evaluates a model's reasoning generalizability on unseen QA entries without any supervision signals from the corresponding annotated training data.To tackle this task, two primary pipelines have emerged in existing works.The first paradigm employs off-the-shelf language models without changing the parameters, either using vanilla language modeling with prompts (Trinh and Le, 2018;Li et al., 2022), or with some inferencetime mechanisms specifically designed for reasoning, such as self-talk (Shwartz et al., 2020), cloze translation (Dou and Peng, 2022), and dynamic generation of reasoning sub-graphs and graph reasoning (Bosselut et al., 2021).The second pipeline leverages external CSKBs as knowledge sources to provide PLMs with additional supervision signals for further fine-tuning (Banerjee and Baral, 2020;Ma et al., 2021;Su et al., 2022).A common strategy involves converting knowledge triples in CSKBs to synthetic QA pairs by transforming the head and relation to a question, the tail to a gold answer, and (randomly) sample tails from other heads as distractors.Such fine-tuning paradigm benefits from incorporating CSKBs within different domains (Kim et al., 2022;Shi et al., 2023) and exploiting multi-hop graph structures with graph neural networks (Guan et al., 2023), and heightens the model's commonsense sensitivity in a QA context, which leads to state-of-the-art performances.
Conceptualization.Conceptualization refers to the process of abstracting a group of instances or events into a general concept (Song et al., 2011(Song et al., , 2015)).In commonsense reasoning, it simulates conceptual induction (Murphy, 2004) and enables the derivation of abstract commonsense knowledge under the specific contextualization of the original commonsense knowledge (Tenenbaum et al., 2011), which is often lacking in existing CSKBs.Around many existing works studying conceptualization (Durme et al., 2009;Gong et al., 2016;Liu et al., 2022;Peng et al., 2022), He et al. (2022) investigate it at event-level semantics and construct AbstractATOMIC, an event conceptualization benchmark and knowledge base based on ATOMIC (Sap et al., 2019a).Recently, Wang et al. (2023a) propose to conceptualize CSKBs at scale with semi-supervised learning and demonstrate abstract knowledge can enhance commonsense inference modeling (Bosselut et al., 2019;Da et al., 2021).With current works mostly investigating the problem of conceptualization itself, none of them have extrinsically evaluated the impact of conceptualization on downstream tasks, such as commonsense QA (Talmor et al., 2019) or machine reading comprehension (Nguyen et al., 2016).Data Augmentation.Data augmentation aims at generating new examples from existing data to expand the size and diversity of a training set without requiring costly data annotations (Wei and Zou, 2019).Various methods have been proposed to augment textual data, including those using random perturbation (Wei and Zou, 2019), text embeddings (Wang and Yang, 2015), lexical semantics (Niu and Bansal, 2018), back translation (Sennrich et al., 2016), and large language models (West et al., 2022;Ismayilzada and Bosselut, 2023;Gao et al., 2023) for CSKB construction.Nevertheless, text-perturbation-based augmentations do not provide new knowledge to CSKBs, and knowledge mining from large language models suffers from high typicality (e.g., favoring simple commonsense over informative yet rare commonsense) and low density, still making negative sampling subject to false negatives (Malaviya et al., 2020).

Definitions
Conceptualization.Formally, denote a CSKB as D with knowledge triples in the format of , where H, R, and T are the sets of heads, relations, and tails in the original CSKB.Following He et al. (2022), the conceptualized CSKB, conditioned on D, can be denoted as , where H c is the set of conceptualized head events.Specifically, each conceptualized head h c is obtained by replacing a component i ∈ h with its abstract concept c while ensuring that the formed (h c , r, t) triple is still plausible in the original context (r, t).Such (h c , r, t) triples are commonly referred to as abstract commonsense knowledge.
Zero-shot Commonsense QA.In this paper, we employ the zero-shot commonsense QA task proposed by Ma et al. (2021) to study our framework.First, the CSKB D is transformed into multiple (Q i , A i ) pairs where Q i is a natural langauge question and A i = {A i,1 , A i,2 , ..., A i,m } is a set of options with m candidates.Specifically, for a given knowledge triple (h, r, t) ∈ D, we convert h, r into Q i via natural language templates and use t as the ground answer.Additionally, we retrieve m − 1 distractors from other triples sampled from D using a manually defined strategy, such as keyword overlap filtering.The objective of our task is to train a QA model from the synthetic QA sets Once trained, the model is tested on held-out test   entries (Q test , A test ) from QA benchmarks.This requires the model to perform zero-shot commonsense reasoning since the training data from the target benchmarks are unavailable to the model.

Dataset
We use ATOMIC (Sap et al., 2019b) as the source CSKB D. ATOMIC contains inferential commonsense knowledge, in the format of (h, r, t) triple, that is associated with commonly seen events.Specifically, the heads of ATOMIC triples are events, whereas the tail nodes are either events or attributes.For conceptualization, we use the human-annotated abstract knowledge from Abstrac-tATOMIC (He et al., 2022) to train a generative conceptualizer for acquiring D C .More details of conceptualizations and statistics of AbstractATOMIC are provided in Section 4.1 and Appendix B.1.

Evaluation Benchmarks
Following Ma et al. (2021)

CAR Framework
This section introduces our proposed CAR framework.A general sketch is presented in Figure 3.Our framework can be summarized into three steps: (1) Conduct one-step conceptualization inference on existing triples in the CSKB to obtain abstract commonsense knowledge triples.
(2) Transfer the triples into QA pairs and generate distractors using keywords and conceptualizations as constraints.
(3) Train the QA model using marginal ranking loss.

Conceptualization Augmentation
To incorporate abstract knowledge into the CSKB, we begin by augmenting the (h, r, t) ∈ D triples by conducting a one-step conceptualization inference.Initially, given a head event h, we retrieve all plausible conceptualizations C h = {c i 1 ,1 , c i 1 ,2 , ...} for all identified instances i ∈ {i 1 , i 2 , ...|i ∈ h} using entity-linking heuristics to retrieve concepts from Probase (Wu et al., 2012) and WordNet (Miller, 1995).The conceptualized head event h c is then obtained by replacing an i ∈ h with one of its retrieved conceptualization c ∈ {c i,1 , c i,2 , ...}.This is done for all identified instances and their retrieved conceptualizations, thereby constructing the set of conceptualized head events of h.Subsequently, we link the non-abstract counterpart (r, t) after h c to generate candidate abstract knowledge triples (h c , r, t), where we adopt a discriminator trained with a semi-supervised conceptualizationinstantiation framework to determine their plausibility (Wang et al., 2023a).Only plausible triples are kept to form D C .Details about the conceptualization retrieval processes and the discriminator are presented in Appendix B.1.

Concept-Constrained QA Synthesis
To synthesize a commonsense triple (h, r, t) into a (Q i , A i ) pair, we first transfer h, r into Q i by using natural language templates and set t as the groundtruth answer A 1 .For example, the triple in Figure 3 becomes ∅ can be sampled as a distractor candidate.This constraint requires that the two triples have no common keywords, and their instances cannot be abstracted into the same conceptualization.For example, in Figure 3, "(PersonX is at the casino, xWant, have a drink)" cannot be used as a distractor triple because "casino" can be conceptualized as "entertainment place," which is the same as "bar" in the original triple.Finally, we sample two distractor triples for the triple (h, r, t) and use the tails of these two triples as the distractors.To guarantee that the abstract commonsense knowledge from our previous augmentation is learnable by the QA model, we synthesize both the original triple (h, r, t) and its conceptualized versions (h c , r, t) into QA pairs.

Model Training
Following Ma et al. (2021), we train our QA model by fine-tuning a pre-trained Masked Language Model (MLM) using the Marginal Ranking (MR) loss.Let C represent the original context (if any), Q represent the question, and (A 1 , A 2 , ...) be the list of options.We first concatenate C, Q, and an answer option A i together via natural language prompts to generate input sequences (T 1 , T 2 , ...).
For example, the synthesized question with its correct answer in Figure 3 will be transformed as: "PersonX arrives at the bar, as a result, PersonX want to, relax himself."We then repeatedly mask out a token at one time and calculate the masked loss.The final MLM score for an input sequence T ∈ {T 1 , T 2 , ...} with n tokens is: After calculating the scores S 1 , S 2 , ... for all answer candidates A 1 , A 2 , ..., we compute the marginal ranking loss based on Equation 2, where η represents the margin and y is the index of the correct answer.
During the evaluation phase, we use the same scoring procedure to assign a score to each option and select the one whose concatenated sentence achieves the lowest score as the model's prediction.

Setup
Baselines First, we use random voting (Random) and most-frequent labeling (Majority) to demonstrate the characteristics of each benchmark.Vanilla RoBERTa-Large (Liu et al., 2019), and DeBERTa-v3-Large (He et al., 2023) PLMs are used to demonstrate the power of fine-tuning.The performances of these two models under a supervised training regime are also included to show the upper bound of our results.We also include the results of several existing approaches that tackle the same task, including Self-talk (Shwartz et al., 2020), COMET-DynaGen (Bosselut et al., 2021), SMLM (Banerjee and Baral, 2020), MICO (Su et al., 2022), and the previous state-of-the-art STL-Adapter (Kim et al., 2022).Most importantly, we compare our framework with Ma et al. (2021) to validate the efficacy of conceptualization since both methods share similar model architecture and training procedures.Both RoBERTa-Large and DeBERTa-v3-Large are used as the backbones for fair comparisons.There are, in total, 534,833 synthetic QA pairs provided by Ma et al. (2021).
With the recent advances in Large Langauge Models (LLMs) (Bang et al., 2023;Chan et al., 2023;Qin et al., 2023), we also benchmark the performances of GPT3.5 (Brown et al., 2020) and ChatGPT (OpenAI, 2022) 2021) on ATOMIC (marked with △).ATM C stands for the ATOMIC with abstract commonsense knowledge injected.ATM-10X stands for using ATOMIC-10X (West et al., 2022) as the source CSKB D. All baseline results are consistent with their original papers.
the LLM directly in a zero-shot setting, where no in-context learning (Min et al., 2022) or chain-ofthought reasoning (Wei et al., 2022) are applied.
For every QA entry, the LLM is presented with a question, several choices, and a natural language command that asks it to choose the index of the correct answer directly (Robinson et al., 2022).We then parse the generated outputs to obtain the "predictions" of LLM by using meticulously designed rules and compare them with the ground-truth labels.More details of the baselines and LLM setups can be found in Appendix B.2 and B.3.

Implementation Details
We use accuracy as the evaluation metric and compare our framework with the following baseline methods.For conceptualization, we leverage an off-the-shelf conceptualizer from Wang et al. (2023a), which is a semi-supervised conceptualization discriminator fine-tuned on labeled conceptualization data from AbstractATOMIC and unlabeled data from ATOMIC.We use a plausibility score T = 0.9 to filter out plausible conceptualizations, which results in 440K conceptualization-aided synthetic QA pairs for training.We employ an AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 7e-6 and a max sequence length of 128 to accommodate QA pairs with different lengths.We select the best checkpoint according to the highest accuracy achieved on the synthetic validation QA set.Each experiment is repeated using three different random seeds, and the average performance is reported.The model is warmed up with 5% of total iterations and evaluated every 1000 global steps, while the margin η for the marginal ranking loss is set to 1, in line with the choices made by Ma et al. (2021) and Kim et al. (2022).More details about implementations can be found in Appendix B.4,

Results
The main results are reported in Table 1.For the baselines, DeBERTa-v3-Large (MR) trained on ATOMIC achieves the best performance, followed by ChatGPT.Both achieve an accuracy of more than 70% on average.Our best system, based on DeBERTa-v3-Large and trained on our conceptualization-augmented ATOMIC, achieves

Analysis and Discussion
In this section, we study the effects of conceptualization and the reasons contributing to CAR's success.First, we conduct expert evaluations on the synthetic QA pairs to study the quality and diversity of different CSKB augmentation methods in comparison with conceptualization.Second, we conduct training dynamics (Swayamdipta et al., 2020) analysis to show that conceptualization-aided QA pairs can provide more ambiguous examples help-ful for training.Finally, we study the impact of filtering ATOMIC10X with different critic thresholds, the ablations of CAR, and the effect of conceptualization from an out-of-domain generalization perspective in Appendix B.5, B.7, and B.8.

Comparisons With Data Augmentations
To demonstrate the effectiveness of our proposed conceptualization method, we conduct comprehensive analyses with other data augmentations that expand the semantic coverage of CSKBs in a similar way as conceptualization using both expert and automatic evaluations.We use EDA, augmenting with word embedding (Word2Vec; Mikolov et al., 2013 and GLOVE;Pennington et al., 2014), contextual embedding (BERT; Devlin et al., 2019), and synonym (WordNet; Miller, 1995) as baselines.To align all the baselines for fair comparisons, we only augment the identified instance i ∈ h in each ATOMIC triple's head event h according to the number of its valid conceptualizations |C h |.Additionally, we randomly sample another same amount of knowledge from ATOMIC-10X into ATOMIC as a form of augmentation by distilling GPT3 (Brown et al., 2020) to set distilling an LLM as another baseline (more explanations in Appendix B.5).
We comprehensively analyze the comparison using three dimensions: diversity, quality of synthetic QA pairs, and zero-shot commonsense QA performances.Three expert annotators are recruited to facilitate our evaluations who are undergraduate or graduate students actively involved in commonsense research.They demonstrate a high level of agreement among themselves, with an IAA of 83% in terms of pairwise agreement and a Fleiss Kappa score (McHugh, 2012) of 0.64, comparable to 0.62, as reported by (Ma et al., 2021)  Diversity.First, we study whether augmentations can introduce new knowledge to the training set.We begin by calculating the average cosine similarity of each ATOMIC triple and its augmented siblings from each method according to their Sen-tenceBERT (Reimers and Gurevych, 2019) embeddings.For ATOMIC-10X, we regard the sampled knowledge as augmentations.The complement of average similarity across all ATOMIC triples serves as an automatically measured diversity (Div.).Meanwhile, we retrieve top-10 similar triples from ATOMIC for each augmented triple according to their SentenceBERT similarity.The experts are asked to annotate whether each triple can be semantically covered by their retrievals.We define the expert-evaluated diversity as the ratio of uncovered triples among 300 samples.Table 2 shows that conceptualization champions both metrics, indicating that the introduced abstract knowledge is diverse and lacking in existing CSKBs, which is helpful in expanding their knowledge coverage.
Quality of Synthetic QA Pairs.Next, we synthesize the augmented triples into QA pairs with their head events' keywords and augmentations as constraints.We then sample 300 QA pairs for each method and ask the same experts to perform expert evaluations by annotating the correctness of each QA pair's ground-truth answer and whether the distractors could also be plausible with respect to the augmented head event.This evaluates the plausibility ratio of the augmented knowledge and the ratio of QA pairs containing false negative distractors.Table 2 shows that the majority of augmented knowledge is implausible, and they fail to enhance distractors sampling.Conceptualization, on the other hand, maintains being highly plausible and can effectively eliminate false negative distractors.
Expert annotators also achieve a remarkable accuracy of 86% while working on 300 randomly sampled question-answer pairs, surpassing the 80% accuracy reported by Ma et al. (2021).
Finally, we train DeBERTa-v3-Large models on the QA pairs synthesized from the concatenation of both original and augmented ATOMIC triples from each method.Only keywords of each head event are used as their constraints.The models are trained using a marginal ranking loss, as explained in Section 4.3, and evaluated on five QA benchmarks in a zero-shot manner.Performances by different methods are shown in Table 2.We observe that conceptualization outperforms all other augmentations on average and successfully improves the model's zero-shot commonsense reasoning ability.
Comparison with ATOMIC-10X.Augmenting ATOMIC10X appears to be a promising option as it contains a wealth of valuable commonsense knowledge.However, despite its diverse and highquality knowledge, Table 2 demonstrates that the model cannot leverage this information effectively.One possible explanation is that the model's performance is hindered by the significantly high number of false-negative distractors.This issue arises because the knowledge distilled from GPT-3 tends to be versatile, resulting in many tail events being general and vague.These events can be applied to a large collection of heads, which leads to false negative options.More experiments and case studies are in Appendix B.5 and C, respectively.

Training Dynamics Analysis
Training dynamics effectively assess a model's confidence and variability for individual instances when training on a large dataset.In the context of QA, we define confidence as the model's certainty when assigning the correct label to the ground-truth option compared to distractors, as indicated by the logit difference.Variability, on the other hand, refers to the fluctuation of confidence over time.behavior when different knowledge is introduced into the training set.More explanations are in Appendix B.6.
In this section, we examine the impact of abstract commonsense knowledge (conceptualization) and GPT3-distilled knowledge (ATOMIC-10X) by exploring their training dynamics on two sets of data.We train three QA models on synthetic QA pairs from conceptualization-augmented ATOMIC, ATOMIC10X-augmented ATOMIC, and the original ATOMIC, which serves as the baseline.First, we randomly select the same 1,000 QA pairs synthesized from the original ATOMIC and calculate their training dynamics using these three models.The left side of Figure 4 displays the alterations caused by the two augmentation methods in comparison with the baseline.It is evident that introducing abstract commonsense knowledge through conceptualization significantly reduces the model's average variability and enhances its confidence in learning the knowledge from the original ATOMIC.In contrast, incorporating knowledge from ATOMIC-10X produces the opposite effect.
Second, we check the training dynamics on 1,000 randomly sampled QA pairs synthesized from abstract commonsense knowledge and another 1,000 from knowledge in ATOMIC-10X.The rightmost plots in Figure 4 reveal that, compared to ATOMIC-10X, conceptualization introduces knowledge with higher variability and lower confidence, making it more ambiguous and challenging for the model to learn.As Swayamdipta et al. (2020) suggest, such data contributes to a more robust model to out-of-distribution (OOD) data, which are downstream QA datasets in our case.Therefore, we conclude that conceptualization is superior to ATOMIC-10X as abstract knowledge, on the one hand, makes the original knowledge more easy-to-learn to aid optimization, and on the other hand, provides more ambiguous examples to boost OOD generalization.

Impact of Training Data Size
In Figure 5, we present the influence of the number of training examples against the final performance, which reveals a clear and intuitive trend of a positive correlation between the amount of training data and overall performance.

Generalization to other CSKBs
We explore the feasibility of transferring our framework to CSKBs other than ATOMIC.We take the CWWV dataset as an example, which comprises multiple CSKBs, including ConceptNet (Speer et al., 2017), WordNet (Miller, 1995), and Wiki-Data (Vrandecic and Krötzsch, 2014).We use the off-the-shelf GPT2 conceptualizer (Wang et al., 2023a) and ChatGPT as two flexible generative conceptualizers.The generated conceptualizations are then transformed into abstract knowledge and integrated into the CWWV dataset.The experimental results are presented in Table 3, which shows an improvement of over 1% compared to all baselines leveraging CWWV as the source of knowledge, indicating CAR's generalizability to other CSKBs.More details are presented in the Appendix B.9.

Conclusions
In this paper, we present CAR, a pioneering framework for zero-shot commonsense QA empowered by conceptualization.Our approach surpasses even large language models on five QA benchmarks, achieving state-of-the-art performance on average.Our analyses reveal that conceptualization can improve the sampling of negative examples, and abstract knowledge is more helpful compared with those distilled from GPT3 as it provides more ambiguous knowledge to support OOD generalization.These findings demonstrate the substantial benefits of introducing conceptualization and abstract knowledge into zero-shot commonsense reasoning.

Limitations
One limitation of this paper is that the proposed CAR framework has only been validated on the ATOMIC dataset.While previous works (Ma et al., 2021;Kim et al., 2022;Dou and Peng, 2022) have studied the zero-shot commonsense question answering task by consolidating multiple CSKBs, including ATOMIC (Sap et al., 2019a), Concept-Net (Speer et al., 2017), WordNet (Miller, 1995), VisualGenome (Krishna et al., 2017), and Wiki-Data (Vrandecic and Krötzsch, 2014), our work only utilizes ATOMIC (more details discussed in Appendix B.2).This was mainly due to the availability of conceptualizations for the CSKB, with only AbstractATOMIC (He et al., 2022) being available as the conceptualized expansion of ATOMIC, while other CSKBs lack such resources.Additionally, ATOMIC has been shown to play the most critical role in experimental results by Ma et al. (2021).Nonetheless, such limitation does not restrict CAR's potential to seek further improvements from incorporating other CSKBs, as conceptualization frameworks, such as CAT (Wang et al., 2023a), can be applied to other CSKBs and provide the required resources for CAR to operate.Thus, we believe CAR can overcome such limitations and still possess the potential to improve with more CSKBassociated conceptualization resources available.

Ethics Statement
This paper presents CAR, a novel framework for zero-shot commonsense question answering that achieves state-of-the-art performance via conceptualization.All datasets used, including ATOMIC, AbstractATOMIC, and commonsense questionanswering benchmarks, are publicly available and shared via open-access licenses solely for research purposes, consistent with their intended usage.These datasets are anonymized and desensitized, ensuring that no data privacy issues are involved.Moreover, the CAR framework is a questionanswering system that selects the most plausible choice from a list of options and does not yield any private, offensive, biased, or sensitive information or social and political issues.The expert annotations are performed by the authors of this paper as part of their contribution, who are graduate and undergraduate students working on machine commonsense in natural language processing, and they are fully aware of the annotation protocol and the intended use of their annotations.They are well-trained with specially designed instructions and have voluntarily agreed to participate without compensation.Based on this, the authors believe that this paper does not raise any ethical concerns to the best of their knowledge.

Appendices A Benchmark Descriptions
In this section, we introduce more details regarding five evaluation benchmarks.
Abductive NLI (aNLI) (Bhagavatula et al., 2020) is a Natural Langauge Inference (NLI) benchmark that aims to infer the most plausible explanation based on a given causal situation.For each question sample, the model is required to choose the more plausible hypothesis out of two options that fit the beginning and end of a story.This benchmark evaluates the model's abductive reasoning ability, which typically requires commonsense reasoning.
CommonsenseQA (CSQA) (Talmor et al., 2019) is a question-answering benchmark that evaluates a broad range of commonsense aspects.Each sample contains a question and five choices.The question and some choices are generated from subgraphs of ConceptNet (Speer et al., 2017) while crowdsourcing annotators also annotate some distractors.This benchmark evaluates the model's concept-level commonsense reasoning ability.
PhysicalIQA (PIQA) (Bisk et al., 2020) is a question-answering benchmark that requires the model to select the more plausible option out of two possible continuations given a common scenario that requires physical commonsense to infer.This benchmark evaluates the model's physical commonsense reasoning ability.
SocialIQA (SIQA) (Sap et al., 2019b) is a question-answering benchmark that requires reasoning about social interactions.Each sample contains a context that is derived from ATOMIC (Sap et al., 2019a), a question, and three choices.The questions are automatically generated using nine templates that correspond to the nine relations in ATOMIC, and the correct answers are crowdsourced.This benchmark evaluates the model's reasoning ability for emotional and social commonsense in daily situations.
WinoGrande (WG) (Sakaguchi et al., 2021) is a pronoun resolution benchmark.Each sample contains an emphasized pronoun and a short context description.The model is asked to choose the correct reference given two options.This benchmark evaluates the model's pronoun resolution ability, which is also part of commonsense knowledge.
In our experiments, we use the validation splits of these benchmarks as the official testing sets may

B Additional Explanations and Analyses
In this section, we aim to cover additional details regarding the CSKB conceptualization in CAR (Appendix B.1), implementations of our system (Appendix B.4), baselines (Appendix B.2 and B.3), experiments using ATOMIC-10X (Appendix B.5), analyses (Appendix B.6,B.7,and B.8), and generalizability experiments (Appendix B.9) that are not covered in the body text due to space constraints.

B.1 Definitions and Statistics of CSKB Conceptualization
Conceptualization plays a crucial role in generalizable commonsense reasoning.Previous studies have demonstrated its potential in aiding commonsense inference modeling (Wang et al., 2023a) and commonsense knowledge graph construction (Yu et al., 2023;Zhang et al., 2022).In our paper, we follow the definition of conceptualization proposed by He et al. (2022) and Wang et al. (2023a) in conceptualizing an instance within an event to a concept: (1) Events: Each event represents a commonly observed occurrence that encompasses valuable subsequential or inferential commonsense knowledge.In AbstractATOMIC, the events are the head events of all triples in ATOMIC without a wildcard ('_').
(2) Instances: Within each event, multiple instances have been identified with semantic parsing tools, representing specific components of the event that are worthy of conceptualization.
(3) Concepts: Concepts are the conceptualization of each instance.These concepts are thus extracted from Probase/WordNet and further validated by human annotators or critic filtering models.For an event e, which is the head of an ATOMIC triple, an instance refers to either an entity within the event or the complete event itself.Multiple instances can exist within a single event, denoted as i 1 , i 2 , i 3 , . . ., i n ∈ e.A concept corre-sponds to the conceptualization of an instance, and multiple conceptualizations can be associated with a single instance, as demonstrated by (i 1 , c 1 , 1), (i 1 , c 1 , 2), (i 1 , c 1 , 3), ..., (i 2 , c 2 , 1), ..., (i n , c n , 1), . . ., (i n , c n , m).For instance, consider the event "PersonX is drunk when exiting the bar."In this case, two instances can be identified: "Per-sonX is drunk when exiting the bar" and "bar."The conceptualization for the instance "PersonX is drunk when exiting the bar" may include "drunk" or "enjoyed," while the instance "bar" can be conceptualized as an "entertainment place" or a "fun place." In this paper, we leverage the AbstractATOMIC dataset, provided by He et al. (2022), as our primary source for conceptualizations.Abstrac-tATOMIC is a benchmark for conceptualized commonsense knowledge that is built upon the ATOMIC dataset (Sap et al., 2019a).It contains three folds of data, each conditioned on the original commonsense knowledge triples (h, r, t) in ATOMIC.
In the first fold, He et al. ( 2022) identify all possible instances {i 1 , i 2 , i 3 , • • • |i ⊆ h} in each ATOMIC head event, using syntactic parsing through a spaCy 1 parser and matching with five human-defined rules.It is important to note that, unlike traditional entity-level conceptualization benchmarks, the identified instance in Abstrac-tATOMIC can also be the entire head event i = h.
In the next fold, each identified instance i is heuristically matched against Probase (Wu et al., 2012) and WordNet (Miller, 1995) via Gloss-BERT (Huang et al., 2019) to find their corresponding conceptualization candidates.Human annotations are conducted to verify part of the plausibility of the conceptualization candidates.To pseudolabel unannotated conceptualizations, we use a semi-supervised conceptualization discriminator provided by Wang et al. (2023a) and set a threshold of T = 0.9 to filter out plausible conceptualizations.Additionally, we utilize a GPT2-based (Radford et al., 2019) generator, trained on the concatenation of annotated and positively pseudo-labeled conceptualizations, to generate additional conceptualizations for further expanding the size of the conceptualization bank.
However, it is worth noting that such conceptualization may not yield plausible abstract knowledge when (r, t) is connected back to h c , where h c is  Wang et al. (2023a).D l h stands for humanannotated conceptualizations and D u h are unlabeled conceptualizations.Unq stands for unique, and Avg refers to average.obtained by replacing i ∈ h with its conceptualizations.This is because the process of conceptualizing a head event omits its context in (r, t).Thus, the last fold of data stores the plausibility of such abstract commonsense triples (h c , r, t), where human annotations are conducted to verify part of the triples' plausibilities.In addition, we adopt a semi-supervised instantiation discriminator, provided by Wang et al. (2023a), to pseudo-label the unannotated triples.Another threshold, T = 0.9, is set to filter out plausible abstract triples.
In the CAR framework, for every ATOMIC event h, we retrieve every instance i's plausible conceptualizations {c i,1 , c i,2 , • • • } from all plausible conceptualizations derived in the second fold to serve as the distractor sampling constraint.We also augment the original (h, r, t) triples with their plausible (h c , r, t) siblings from both human-annotated and pseudo-labeled triples, as explained in the last fold.These knowledge triples are then synthesized into QA pairs using our proposed method to train the model to perform general reasoning.Detailed statistics for the conceptualizations and abstract commonsense triples we finally obtained from the AbstractATOMIC dataset are reported in Table 5 and Table 6, respectively.

B.2 Baseline Performances
For SMLM (Banerjee and Baral, 2020), we adopt the official implementation of Banerjee and Baral (2020), which employs the CSKB that exhibits the highest alignment with each task.Specifically, So-cialIQA uses ATOMIC, while CommonsenseQA uses ConceptNet.For STL-Adapter (Kim et al., 2022), only those trained on ATOMIC are used for comparison in the body text.In this paper, all baseline performances are solely based on their officially reported results in their respective papers.
As noted in the Limitations section, previous re-  search in this area has primarily focused on using four CSKBs, namely ATOMIC (Sap et al., 2019a), ConceptNet (Speer et al., 2017), WordNet (Miller, 1995), and WikiData (Vrandecic and Krötzsch, 2014).In order to comprehensively benchmark our framework's performance in the field of zeroshot commonsense QA, we compare our results on ATOMIC against baseline methods that use multiple CSKBs despite the unbalanced amount of knowledge in such a comparison.Table 11 presents a full comparison of our method with all existing baselines.Notably, for models based on RoBERTa-Large, our approach trained only on abstract knowledge injected ATOMIC achieves second place in the leaderboard, falling only behind Kim et al. ( 2022) with four CSKBs.While this comparison may be unfair due to the unbalanced amount of knowledge, it provides a strong justification for the excellent performance of our system.Our DeBERTa-v3-Large-based model still surpasses all baselines on average, indicating the necessity of leveraging a strong pre-trained language model.

B.3 Benchmarking Large Language Models
We then discuss our method for benchmarking large language models on five commonsense QA benchmarks.The emergence of Large Language Models (LLMs), such as ChatGPT (OpenAI, 2022), has been the hot trend in recent NLP research.Numerous studies have evaluated the capability of LLMs on various NLP downstream tasks.Among them, Qin et al. (2023); Chan et al. (2023) have shown that ChatGPT can achieve competitive performance on commonsense reasoning tasks, such as CommonsenseQA (Talmor et al., 2019), Wino-Grande (Sakaguchi et al., 2021), and Commonsense Knowledge Base Population (Fang et al., 2021b,a).In this study, we aim to benchmark Chat-GPT's zero-shot performance on five QA evaluation benchmarks used in our zero-shot commonsense QA task.Following (Robinson et al., 2022), we design and leverage a batch of prompts, as shown in Table 7, to probe ChatGPT's predictions.The prompt provides ChatGPT with a question and its possible choices, along with a natural language command to control the response action of ChatGPT.We then parse the generations using our meticulously designed rules, where punctuations and irrelevant wordings will be dropped, and the first choice-letter prediction will be identified as ChatGPT's answer.Specifically, if ChatGPT hesitates and cannot make a concrete prediction, it will be counted as a wrong answer.The benchmarking results are shown in Table 1.We observe that Chat-GPT demonstrates superior performance compared to GPT3.5 (Ouyang et al., 2022) and excels in tasks such as CommonsenseQA (Talmor et al., 2019) and SocialIQA (Sap et al., 2019b), potentially due to the high frequency of their questions and answers in large text corpora.However, its performance on the remaining three benchmarks is suboptimal, suggesting that they are more challenging and require more complex reasoning (Bai et al., 2023;Ding et al., 2023) and implicit commonsense knowledge to solve.This intriguing outcome warrants further investigation to determine the reasons behind it and explore methods to boost the LLM's abilities in these challenging benchmarks.
Generally speaking, CAR and conceptualization own the advantage over the large language model in the following aspects: (1) Smaller Model Size: The CAR framework offers models that are significantly smaller in scale compared to LLMs (0.2% of a standard 175 billion parameter GPT-3 model) while maintaining comparable performance in a zero-shot setting.Such size makes it more efficient in terms of training and deployment.In contrast, advanced prompting techniques used in LLMs require extensive computational resources, making the conceptualization-based model more versatile and accessible to researchers with limited access or resources for deploying LLMs.(2) Broader Commonsense Knowledge: Conceptualization provides a broader range of commonsense knowledge compared to current CSKBs.Integrating this type of knowledge into generative models has been shown ATOMIC-10X, we set multiple critic thresholds to filter the dataset accordingly.The QA models are trained using marginal ranking loss on four subsets of ATOMIC-10X, each with a different critic threshold of 0.9, 0.8, 0.7, and 0.5, along with an additional model trained on the complete ATOMIC-10X dataset.Finally, we evaluate these models on five commonsense QA benchmarks in a zero-shot setting and report the results in Table 8.Specifically, models trained solely on ATOMIC-10X using critic thresholds of 0.7 (RoBERTa) and 0.0 (DeBERTa) for filtering are responsible for the results reported in Table 1.We observe that even using high critic thresholds to filter the knowledge in ATOMIC-10X, the model still fails to improve beyond marginal.Meanwhile, training the models only on ATOMIC-10X fails to surpass training on ATOMIC, which indicates that the amount of knowledge is not the critical element to determining the performance.Rather, it should be the diversity and quality of knowledge, where the humanannotated knowledge from ATOMIC is superior to those machine-generated ones from ATOMIC.Nonetheless, none of the models outperform those trained on conceptualization-augmented ATOMIC using our CAR framework, which further validates the strengths of CAR.
In the second scenario, as discussed in Section 6.1, we utilize ATOMIC-10X as a means of augmentation to extend the original ATOMIC dataset.This is achieved by randomly selecting a specific number of knowledge triples from ATOMIC-10X, equivalent to the total number of plausible abstract commonsense knowledge in AbstractATOMIC, and merging them back into the original dataset.The triples in the resulting ATOMIC10X-augmented ATOMIC are then transformed into QA pairs and used to train our model following the original pipeline suggested by Ma et al. (2021).Similar to the previous scenario, we set four thresholds, namely 0.9, 0.8, 0.7, and 0.5, to filter the triples in ATOMIC-10X for augmentation quality control.In this way, the QA pairs' distractors can come from ATOMIC and ATOMIC-10X.The models are then trained and evaluated on five benchmarks.Their zero-shot commonsense QA evaluation results are reported in Table 8, and the best model, trained using a critic threshold of 0.8 for filtering with DeBERTa-v3-large as the backbone, is responsible for the results indicated in Table 2. Interestingly, we observe that leveraging the knowledge in ATOMIC-10X, either for direct training or augmentation, occasionally improves the model's performance on a specific benchmark.However, it fails to boost the overall performance across all benchmarks on average, which is considered a closer metric for evaluating the generalizable reasoning ability of a commonsense QA model.Thus, we come to the conclusion that ATOMIC-10X is inconsistently helpful in improving the zero-shot commonsense QA performances, with most times failing to improve, while conceptualization resolves such issues and can benefit the model across all benchmarks significantly.One potential reason is that ATOMIC-10X main contain noise that are not benefitial to the task of zero-shot commonsense QA, as demonstrated by Deng et al. (2023).

B.6 Training Dynamic Definitions
Training dynamic, as proposed by Swayamdipta et al. (2020), refers to the analysis of a model's behavior on individual instances during training on large datasets.This analysis examines the model's confidence in predicting the true class of an instance and the variability of this confidence across epochs.To achieve this, multiple checkpoints are saved throughout a training epoch, and probability scores are derived for each data instance to calculate their training dynamics.By plotting the training dynamics of all instances on a data map, instances can be categorized into three groups: easyto-learn, ambiguous, and hard-to-learn.For instance, consider a QA pair where a model consistently assigns a higher logit score to the correct answer than to other distractors across multiple checkpoints during an epoch.In this scenario, the model exhibits high confidence and low variability for that specific instance, suggesting that it is easy to learn.Conversely, instances with higher variability are ambiguous to the model, and those with low confidence are difficult to learn.Experimental results by Swayamdipta et al. (2020) demonstrates that training the model with ambiguous data contributes the most to out-of-distribution generalization.
Inspired by this finding, our research investigates the role of abstract commonsense knowledge within the training set and the effects of leveraging conceptualization.Since our QA model is trained with a marginal ranking loss, as described in Section 4.3, it does not output a probability score but rather an MLM score for each option.Thus, the definition of model's confidence proposed by Swayamdipta et al. (2020) does not fit into our problem definition.To address this, we re-define the calculation of confidence to align with the model's degree of certainty in predicting an instance as the true class.Formally, denote n as the number of saved checkpoints during an epoch for computing their training dynamics and the list of m options in a (Q i , A i ) pair as A i = {A i,1 , A i,2 , ..., A i,m } with A i,j being the ground-truth answer (1 ≤ j ≤ m)).We define the confidence of the model for such a QA pair in Equation 3, where σ is the sigmoid function and S c i,d is the score of option A i,d at checkpoint c.
Intuitively, this equation averages the gap between the ground-truth answer's score and the score of each distractor.A larger gap indicates a more confident model when choosing the answer.Variability aligns with the definition established by Swayamdipta et al. (2020).Specifically, it is calculated as the standard deviation of the score gap between the ground-truth answer and the distractors relative to the level of confidence exhibited throughout an entire epoch, as shown in Equation 4.
By revisiting the plots in Figure 4, we observe that the inclusion of abstract commonsense knowledge enhances the model's confidence and reduces variability when encountering knowledge in ATOMIC.The introduction of conceptualization appears to widen the differences between the model's predicted scores for the correct answer and those for the distractors.This suggests that the correct answer is more likely to be selected, leading to an improved learning outcome.However, the introduction of knowledge from ATOMIC-10X results in a reversed trend, indicating that it does not aid in better learning ATOMIC.Furthermore, we observe that abstract knowledge derived from conceptualizations is more ambiguous to the model in the conceptualization-augmented ATOMIC, which theoretically contributes more to out-of-domain generalization.Nonetheless, ATOMIC-10X still contains some easy-to-learn knowledge that does We also plot the changes in training dynamics on different QA benchmarks, comparing models with and without the injection of abstract knowledge.The plots are shown in Figure 7.We observe that the inclusion of abstract commonsense knowledge significantly improves the models' confidence in downstream QA entries.However, the impact on the trend of variability is unclear.Nevertheless, this improvement in average confidence provides strong evidence for the model's enhancement in these downstream QA benchmarks.

B.7 Ablation Study
Next, we study the ablation of different components in our CAR framework to determine the impact of utilizing conceptualization through various techniques.There are two critical components that distinguish CAR from traditional zero-shot QA systems (Ma et al., 2021): • Conceptualization Augmentation: Augmenting the original commonsense knowledge in a CSKB with its conceptualizations to derive abstract commonsense knowledge.This knowledge is then synthesized into QA pairs, enabling the model to reason from a more generalized perspective.Without this component, abstract commonsense knowledge is not incorporated into the CSKB.Conceptualizations still remain as constraints for assisting QA pair synthesis, resulting in an approach that is similar to applying our proposed QA synthesis protocol directly to ATOMIC.
• Concept-Constrained QA Synthesis: Constraining a question's distractors by ensuring that none of their head events share a common keyword or conceptualization with the question's keywords and conceptualizations.If this component is dropped, the constraint will be downgraded, and only no sharing of common keywords between the question and distractors will be restricted.This approach introduces abstract commonsense knowledge into the CSKB and uses the original distractor generation strategy for synthesizing QA pairs.
We then train two batches of QA models, using RoBERTa-Large and DeBERTa-v3-Large as the backbone, by sequentially dropping the two components mentioned above one at a time.Their zeroshot performances on five commonsense QA benchmarks are reported in Table 9.From the results, it is observed that both components play important roles in CAR, with CCQS being more effective on average.This underscores the significance of eliminating false negative distractors, and conceptualization proves to be a useful tool for achieving this objective in improving the QA model's overall performance.

B.8 The Effect of Conceptualization
Lastly, we study the improvement in the generalizability of our framework with the aid of conceptualizations by examining the accuracy gains on questions with varying levels of semantic overlap with knowledge in ATOMIC's training split.To do so, we sort the questions in every benchmark by their average BERTScore (Zhang et al., 2020) between each individual question entry against the whole training set in the original ATOMIC.We then split the questions into two sets based on their BERTScores, with the lower BERTScore indicating a lower semantic overlap and a greater need for the model to generalize to answer the question.These questions are denoted as "Difficult."Conversely, we refer to questions with high BERTScores as "Easy." Then, we train two QA models following the pipeline proposed by Ma et al. (2021), with one trained on conceptualization-augmented ATOMIC and the other on ATOMIC only.We evaluate their performance on five commonsense QA benchmarks and compare the performance gains between two sets of questions in each benchmark, as shown in Figure 6.Results demonstrate that incorporating conceptualizations positively impacts accuracy, particularly for questions that deviate significantly from ATOMIC across multiple benchmarks.This indicates that augmenting ATOMIC with conceptualizations can improve the model's generalizability, particularly for questions that tend to be out-ofdistribution, requiring more relevant knowledge to answer correctly.

B.9 Generalization to Other CSKBs
While our work primarily experiments with the AbstractATOMIC dataset as the conceptualization source of ATOMIC, we also aim to extend our framework to other CSKBs for a more generalizable evaluation.To address this, we follow Ma et al. (2021) and explore the feasibility of transferring our framework to the CWWV dataset, which comprises multiple CSKBs including Concept-Net (Speer et al., 2017), WordNet (Miller, 1995), and WikiData (Vrandecic and Krötzsch, 2014).To accomplish this, we train a conceptualization generator based on GPT2 (Radford et al., 2019) and utilize ChatGPT (OpenAI, 2022) as two flexible generative conceptualizers.The generated conceptualizations are then transformed into abstract knowledge and integrated into the CWWV dataset.This augmented dataset is employed to train a zeroshot commonsense QA reasoner using our proposed CAR framework.We present the experimental results and compare them with baselines in Table 10.Our observations reveal a modest improvement in an average accuracy of 1% compared to all baselines and comparable performance to GPT3.5.These results demonstrate the effectiveness of incorporating conceptualizations from other CSKBs.In future research, we suggest exploring automatic construction methods for conceptualization resources in other CSKBs and investigating their potential benefits for general commonsense reasoning.

C Case Study
In this section, we present case studies to demonstrate the effectiveness of CAR.First, we discuss cases that illustrate the power of conceptualization augmentation, as shown in Table 12.By transforming triples into abstract commonsense knowledge, we can introduce more general knowledge into the CSKB and improve its coverage.Moreover, the newly introduced triples were missing from the original CSKB.For instance, conceptualizing "Per-sonX plays the games together" as an "entertainment activity" introduces higher-level knowledge that cannot be simply represented by the original triple.Additionally, by synthesizing both types of triples into QA pairs, the QA model can learn both types of knowledge, which can help the model perform more generalizable reasoning on out-ofdistribution commonsense QA benchmarks.Next, in Table 13, we present QA pairs consisting of false negative options generated using keyword constraints during their synthesis from both the original ATOMIC and ATOMIC-10X.We also demonstrate how our concept constraint resolves this issue.Through these case studies, we observe that the original distractors may contain one or even both plausible options, which is suboptimal when training a QA model.Specifically, for distractors sampled from ATOMIC-10X, we observe that several distractors are vague and general (denoted as "?"), which can be plausible in many contexts.For example, in various triples, adjectives like "happy" and verb phrases such as "do it" are easy to be plausible and do not serve as significant distractions.This is not desirable when training a QA model.However, by using conceptualizations as a constraint, the newly sampled distractors are all strong negatives, allowing the model to learn from such negative commonsense knowledge.This is because the distractors are sourced from triples that are more likely to be irrelevant to the original triple's context and, thus, more likely to be truly negative distractors.

Figure 1 :
Figure 1: An example of constructing synthetic QA pairs from CSKB(Ma et al., 2021).The simple heuristic used in this process can result in false negative options.
PersonX arrive at the bar, what does PersonX want to do? A: relax himself B: keep fit C: cry

Figure 3 :
Figure 3: An overview of the CAR framework, which shows the process of synthesizing (PersonX arrive at the bar, xWant, relax himself) into QA pairs.The triple is conceptualized first, and potential distractor triples are sampled and filtered by keyword and concept overlap.Only those triples that have no overlap are used as distractors.

Figure 4 :
Figure 4: Analyses on training dynamics of different knowledge.The dotted lines refer to the median values.

Figure 5 :
Figure 5: Average accuracy achieved by models trained on our training set downsampled to several ratios.

Figure 6 :
Figure 6: Comparison of accuracy improvement (%) with/without conceptualization-augmentation for two groups of QA entries across five benchmarks.Avg.stands for averaging across all benchmarks.

Table 2 :
Comparison results (%) of different augmentation methods against conceptualization.N/A stands for not using any augmentation.Plau. is the expert-evaluated ratio of plausible augmented knowledge, %F.Neg.represents the expert-annotated proportion of false negative options.Div. and Exp.Div.are diversities measured by embedding similarity and expert annotated knowledge coverage.Performances on the right refer to accuracies achieved by the QA model trained on data augmented by each method.The best performances are bold-faced.

Table 3 :
Experiments on the generalizability of CAR on other CSKBs (CWWV).
These insights can aid in analyzing the model's

Table 4 :
Statistics on the number of QA pairs and the number of options for each question within each benchmark's validation split.notbepublicly available.Detailed statistics on the number of QA pairs and the number of options per question are reported in Table4.

Table 5 :
Statistics of conceptualizations used in CAR, as reported by

Table 10 :
Zero-shot evaluation results (%) on five commonsense question answering benchmarks by models trained on the CWWV dataset.CWWV C refers to the augmented CWWV dataset using generated conceptualizations from a trained GPT2 generator and ChatGPT.