Retrieval Augmentation for Commonsense Reasoning: A Unified Approach

A common thread of retrieval-augmented methods in the existing literature focuses on retrieving encyclopedic knowledge, such as Wikipedia, which facilitates well-defined entity and relation spaces that can be modeled. However, applying such methods to commonsense reasoning tasks faces two unique challenges, i.e., the lack of a general large-scale corpus for retrieval and a corresponding effective commonsense retriever. In this paper, we systematically investigate how to leverage commonsense knowledge retrieval to improve commonsense reasoning tasks. We proposed a unified framework of retrieval-augmented commonsense reasoning (called RACo), including a newly constructed commonsense corpus with over 20 million documents and novel strategies for training a commonsense retriever. We conducted experiments on four different commonsense reasoning tasks. Extensive evaluation results showed that our proposed RACo can significantly outperform other knowledge-enhanced method counterparts, achieving new SoTA performance on the CommonGen and CREAK leaderboards.


Introduction
Recent work has shown that scaling language models with considerably more data and parameters, such as GPT3-175B (Brown et al., 2020), could drive significant advances in commonsense reasoning tasks.Nevertheless, such models make predictions by only "looking up information" stored in their parameters, making it difficult to determine what knowledge is stored or has been already forgotten by the neural network (Guu et al., 2020).Besides, storage space is limited by the size of the neural network.In order to memorize more world knowledge, one must train ever-larger networks, which can be prohibitively expensive and slow.
The solution that may seem obvious at first glance is to grant language models free access to open-world sources of commonsense knowledge in a plug-and-play manner, instead of memorizing all world knowledge.To achieve this capability, language models must be able to retrieve relevant commonsense knowledge from an unbounded set of situations.Then, the language models can leverage the input text, as well as the retrieved information to produce the desired output.
Compared with the large-scale language model counterparts, e.g., UNICORN (Lourie et al., 2021), retrieval-augmented methods have three remarkable advantages: first, the knowledge is not stored implicitly in the model parameters, but is explicitly acquired in a plug-and-play manner, leading to great scalability; second, the paradigm generates text based on some retrieved references, which alleviates the difficulty of generating from scratch (Li et al., 2022); third, knowledge corpus can be constantly edited and updated by experts, making the model aware of the latest information.Besides, compared with knowledge graph inference model counterparts, e.g., QA-GNN (Yasunaga et al., 2021), retrieval-augmented methods allow more flexibility in accessing and using knowledge from different sources, because of the nature of commonsense knowledge, which cannot all be contained in a single knowledge graph defined by a certain schema (Yu et al., 2022b).
A common thread of retrieval-augmented methods in the existing literature focuses on retrieving encyclopedic knowledge such as Wikipedia, which lends itself to a well-defined space of entities and relations that can be modeled (Karpukhin et al., 2020;Lewis et al., 2020b;Yu et al., 2022a).However, retrieval-augmented methods for commonsense reasoning have been rarely studied in the literature.In this paper, we propose a unified frame-RACO ARISTOROBERTA RE-T5 KFCNET OPENCSR (this work) (Mihaylov et al., 2018) (Wang et al., 2021) (Li et al., 2021) (Lin et al.,  work of Retrieval-Augmented Commonsense reasoning (RACO) to solve various commonsense tasks.RACO first retrieves relevant commonsense documents from a large-scale corpus, then combines the input text with the retrieved documents to produce the desired output.However, there are two main challenges in training a RACO model.The first challenge to address is what commonsense knowledge to retrieve.Different from encyclopedic knowledge used in open-domain QA tasks, commonsense knowledge is very diverse, containing everyday events and their effects, facts about beliefs and desires, and properties of objects in human's daily life.Since commonsense involves various aspects including human interaction and object properties in everyday life, we collected a over 20 million commonsense documents collection from both open-domain knowledge sources (e.g., OMCS) that cover multiple domains of commonsense, and domain-specific sources (e.g., ATOMIC) that focus on particular commonsense types.
The second challenge is to address how to retrieve relevant commonsense knowledge from the corpus.Different from training a dense retriever on Wikipedia (Karpukhin et al., 2020), the heuristic of taking "documents containing correct answers" as positive candidates cannot be used because the output answer in commonsense reasoning tasks is usually not a substring of retrieved documents.For example, in binary question answering, the answer is True or False but it does not appear in the retrieved documents.Therefore, we propose novel strategies to construct question-document pairs for commonsense dense retriever training.
Overall, our main contributions in this work can be summarized as follows: 1. We collected and publicized a collection of over 20 million documents from three knowledge sources for commonsense knowledge retrieval.
2. We presented a unified framework of Retrieval-Augmented Commonsense reasoning (RACO), and proposed novel strategies for training a strong commonsense knowledge retriever.
3. We evaluated our RACO on four types of commonsense reasoning tasks.Our experiments showed RACO can significantly outperform other knowledge-enhanced counterparts, achieving new SoTA on CommonGen and CREAK leaderboards.

Related Work
Though large-scale language models yield state-ofthe-art performance on many commonsense reasoning tasks, their pre-training objectives do not explicitly guide the models to reason with commonsense knowledge such as the relation and composition of daily concepts in our lives (Zhou et al., 2021), leading to unsatisfactory performance in many real-world scenarios (Talmor et al., 2021;Zhu et al., 2022).Existing work has mainly explored two directions to improve their commonsense reasoning ability.The first is to pre-train or post-train a language model on commonsense corpora (Bosselut et al., 2019;Lourie et al., 2021;Zhou et al., 2021).When the commonsense materials are appropriately selected, this simple strategy could demonstrate significantly superior performance than vanilla pre-trained language models (Zhou et al., 2021).Notable methods include COMET (Bosselut et al., 2019), CALM (Zhou et al., 2021), UNICORN (Lourie et al., 2021), etc. Nonetheless, these methods still suffer from the same drawbacks as the pre-trained language models introduced in §1.The second is to explicitly introduce external knowledge from commonsense knowledge graphs to augment the limited textual information.(Lin et al., 2019;Ji et al., 2020).A KG often provides comprehensive and rich entity features and relations so models can easily traverse links to discover how entities are interconnected to express certain commonsense knowledge.Notable methods include KagNet (Lin et al., 2019), GRF (Ji et al., 2020), QA-GNN (Yasunaga et al., 2021), GreaseLM (Zhang et al., 2022), etc.However, commonsense knowledge lies at an unbounded set of facts and situations that usually cannot be covered by a single knowledge graph defined by a cer-tain schema.Reasoning over multiple knowledge graphs is a challenging task.
Retrieval-augmented method is a new learning paradigm that fuses pre-trained language models and traditional information retrieval techniques (Lewis et al., 2020b).A few recent methods have explored retrieving in-domain commonsense documents from a task-relevant corpus to improve commonsense reasoning performance (Mihaylov et al., 2018;Wang et al., 2021;Li et al., 2021).We provide a detailed comparison in Table 1.Different from existing methods that focus on retrieving knowledge from in-domain corpus, our proposed RACO leverages a much larger and general commonsense corpus collected from multiple sources that provide supportive evidences for various commonsense reasoning tasks.Meanwhile, we proposed several novel strategies for training a commonsense retriever that can be generalized to different commonsense reasoning tasks.

Proposed Method
In this section, we elaborate on how to leverage commonsense knowledge retrieval from a largescale corpus to improve various commonsense reasoning tasks, including commonsense corpus construction ( §3.1), commonsense document retriever ( §3.2) and commonsense document reader ( §3.3).The architecture of RACO is shown in Figure 1.

Commonsense Corpus Construction
Commonsense knowledge includes the basic facts about situations in everyday life, which is shared by most people and implicitly assumed in communications (Li et al., 2022).Commonsense knowledge has two important properties: large and diverse.
Regarding the scale of knowledge, many commonsense corpus contains millions of statements.For example, Wiktionary has more than one million word definitions and descriptions in English.Meanwhile, the commonsense knowledge is diverse, involving various aspects including human interaction and object properties.For example, OMCS 3 covers multiple domains of commonsense such as everyday events and their effects (e.g., mop up the floor if we split food over it), facts about beliefs and desires (e.g., study hard to win scholarship), and properties of objects (e.g., goat has four legs).The diversity of knowledge is beneficial for retrieval-augmented methods because it enables relevance comparison across different sources, and offers textual knowledge to easily augment the input of generation models by concatenation.To build a large-scale commonsense corpus covering diverse sources, we collected commonsense documents from the following three aspects: (i) human annotated facts; (ii) commonsense benchmark datasets; (iii) commonsense relevant web corpus.The statistics can be found in Table 2.
Commonsense relevant corpus (CRC).It consists of raw statements about commonsense from the web, usually after some simple filtering.We obtained a filtered version from AI2 commonsense corpus, which is a merged corpus collected from ARC (Clark et al., 2018), QASC (Khot et al., 2020) and GenericsKB (Bhakthavatsalam et al., 2020).

Commonsense Document Retrieval
Given a collection of M commonsense documents, the goal of our retriever is to map all the documents in a low-dimensional vector, such that it can efficiently retrieve the top-k documents relevant to the input text.Note that M can be very large (e.g., over 20 million in our experiments) and k is usually small (e.g., 10 or 20 in our experiments).
In this work, we follow the neural document retriever DPR (Karpukhin et al., 2020) to employ two independent BERT (Devlin et al., 2019) models to encode the query and the document separately, and estimate their relevance by computing a single similarity score between their [CLS] token representations.Specifically, the document encoder E D (•) which maps any text document to a low-dimensional real-valued vectors and builds an index for all the M documents used for retrieval.At runtime, it applies a different query encoder E Q (•) that maps the input question to a vector of the same dimension as the document vector, and retrieves top-k documents of which vectors are the closest to the question vector.The similarity between the question and the document is calculated by the dot product of their vectors. (1) Recent efforts have shown that DPR transfer poorly to other domains (Li and Lin, 2021;Kulshreshtha et al., 2021).Thus, the primary challenge of training a strong commonsense retriever is to appropriately construct positive pairs and hard negative pairs (Karpukhin et al., 2020;Xiong et al., 2021).To do this, we propose novel strategies to construct question-document pairs that can be used for training a strong commonsense retriever.

Positive Training Pairs
In open-domain document retrieval, it is often the case that positive training pairs are available explicitly.For example, DPR treated Wikipedia documents that contain the correct answer as positive documents (Karpukhin et al., 2020).However, such training pairs might not be applicable on commonsense reasoning tasks because the output (e.g., True / False in a binary question answering) is not supposed to be a sub-string of retrieved documents.
In order to train a strong commonsense dense retriever, we propose two novel strategies to construct positive training pairs, as described below.
Explanation as positive document.The first method for constructing positive training pairs is to take human annotated explanations as positive documents.For examples, taking the question "Where do people go to pray? (A) church" from CommonsenseQA1.0 as input, the explanation annotated in Aggarwal et al. (2021) is "People go to a church to pray"; similarly, a positive document for the question "When food is reduced in the stomach, nutrients are being deconstructed" in OpenBookQA (Mihaylov et al., 2018) could be "Digestion is when stomach acid breaks down food".The explanations have two important properties.First, they contain commonsense knowledge, such as people praying in church, in the form of natural language.Second, they can be used to help select the correct choice in commonsense reasoning tasks.So, we take advantage of the high correlation of natural language explanations with the input query, defining the input query as q and the corresponding generated explanation as d to train the retriever.
Ground truth output as positive document.The second method for constructing positive training pairs is to directly use ground truth outputs in generation tasks as positive documents.The ground truth output can be seen as a natural positive document that the retriever should retrieve.For example, in the CommonGen (Lin et al., 2020) dataset, the ground truth output for an input concept set {dance, kid, room} is "a group of kids are dancing around a living room".We define the input sequence in a generation task as q and its corresponding ground truth output as d to train the retriever.During training, the vector distance between them are minimized.During inference, though the ground truth documents are no longer in the commonsense corpus, the retriever can still fetch relevant documents similar to the ground truth output such as "a couple of kids are dancing on the floor (ARC corpus)", which provides relevant contexts describing the potential reaction between the input concepts "kid" and "dance", hence helps generate desired outputs.

Negative Training Pairs
For negative pairs, we adopt the trick of in-batch negatives, which has been shown as an effective strategy for learning a dual-encoder model and used in the many recent dense retrieval models (Lee et al., 2019;Karpukhin et al., 2020).

Commonsense Document Reader
After retrieving commonsense documents, the reader takes the input text along with the retrieved documents to produce the desired output.Sequence classification tasks are considered as a target sequence of length one.In our work, we use the fusion-in-decoder (FiD) (Izacard and Grave, 2021) model as the reader.Specifically, each retrieved document is concatenated with the input text, then independently encoded by the T5 (Raffel et al., 2020) encoder.Then, the T5 decoder performs cross-attention over the concatenation of the resulting representations of all the retrieved documents.

Tasks and Datasets
Multi-choice question answering.Give a question, an intelligent system is asked to select one correct answer from the choices offered as a list.We conducted experiments on Common-senseQA1.0 (Talmor et al., 2019) and Open-BookQA (Clark et al., 2018).CommonsenseQA1.0(CSQA1.0)contains 12,102 questions with one correct answer and four distractor answers.Open-BookQA (OBQA) consists of 5,957 elementarylevel questions with one correct answer and three distractor answers.For evaluation, the primary metric on these two tasks is accuracy (ACC.).
Commonsense fact verification.Given a commonsense claim, an intelligent system is expected to verify the statement in natural text against facts.For example, the statement "A pound of cotton has the same weight as a pound of steel" in the Com-monsenseQA2.0(Talmor et al., 2021) should be identified as true.We conducted experiments on two commonsense fact verification datasets, including CommonsenseQA2.0(Talmor et al., 2021) and CREAK (Onoe et al., 2021).CommonsenseQA2.0 was collected via gamification, which includes 14,343 assertions about everyday commonsense knowledge.CREAK is designed for commonsense reasoning about entity knowledge, which consists of 13,000 assertions about entities.For evaluation, the primary metric is accuracy (ACC.).
Constrained commonsense generation.Given a set of concepts such as "dog, frisbee, catch, throw", the task is to generate a coherent sentence describing an everyday scenario such as "a man throws a frisbee and his dog catches it".Our experiments were conducted on the benchmark dataset Commongen (Lin et al., 2020).It consists of 79,000 commonsense descriptions over 35,000 unique concept-sets.The average input / output length is 3.4 / 10.5 words.All examples in the dataset have 4-6 references.The task is evaluated by SPICE (Anderson et al., 2016), BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004), CIDEr (Vedantam et al., 2015), in which SPICE is the primary metric for leaderboard ranking.
Counterfactual explanation generation.Given a counterfactual statement, the task is to generate reasons why the statement does not make sense.Our experiments were conducted on the benchmark dataset ComVE from SemEval-2020 Task 4 (Wang et al., 2020).It contains 11,997 examples.The average input/output length is 7.7 / 9.0 words.All ground truth have 3 references.The task is evaluated by SPICE (Anderson et al., 2016), BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004), CIDEr (Vedantam et al., 2015), in which BLEU-4 is the primary metric for leaderboard ranking.

Baseline Methods
We compared our RACO with various kinds of baseline methods.In addition of comparing with pre-trained language models, such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020), we also compared with three kinds of commonsense knowledge augmented methods as introduced below.
Commonsense-aware language models (CLM).They are trained with external commonsense corpus or datasets, in order to embed commonsense knowledge into their parameters.During finetuning, the language models make predictions without accessing to any external corpus.In the experiments, we compared our model with CALM (Zhou et al., 2021) and UNICORN (Lourie et al., 2021).Knowledge graph reasoning models (KGM).KGs are incorporated into models for augmenting the limited information in the input texts.We compared our model with KagNet (Lin et al., 2019), GRF (Ji et al., 2020), KG-BART (Liu et al., 2021), QA-GNN (Yasunaga et al., 2021), MoKGE (Yu et al., 2022c) and GreaseLM (Zhang et al., 2022).Retrieval augmented models (RAM).We compared with a recent retrieval-augmented method named KFCNet (Li et al., 2021) for constraint commonsense generation.In addition, we also compared with using sparse retriever such as BM25 to retrieve knowledge from our constructed commonsense corpus and use FiD (Izacard and Grave, 2021) as generator to produce outputs.

RACO v.s. Baseline Methods
Comparison with non-retrieval methods.To observe the effectiveness of retrieval on commonsense reasoning tasks, we first compared model performance with and without commonsense retrieval.
As shown in reasoning performance by a large margin.Specifically, RACO improved BLEU-4 by +8.44% on the commonsense generation tasks, improved accuracy by +5.43% on the multiple choice question answering tasks, and improved accuracy by +6.15% on the commonsense verification tasks.Therefore, we concluded that RACO can leverage the retrieval of relevant references from commonsense corpora to help language models produce better outputs in various commonsense reasoning tasks.
Comparison with other knowledge-enhanced methods.As mentioned in the §4.2, the commonsense reasoning ability of a language model can be enhanced by fine-tuning on commonsense corpora or reasoning over multi-hop relations on knowledge graphs.As indicated by

Effects on Commonsense Retriever
To evaluate the effectiveness of commonsense retrieval, we compare the performance of different retriever training settings, including BM25, DPR Wiki , and DPR RACo .Specifically, DPR Wiki directly uses the DPR trained on Wikipedia for commonsense retrieval without any fine-tuning process.DPR RACo trains the commonsense dense retrieval using our our proposed training pairs construction strategy.
As shown in Table 6, we can observe DPR Wiki performs the worst among all retrievers.Our proposed DPR RACo can significantly improve the retrieval performance, compared to BM25.It is important to note that the performance of retrieval is not necessarily linearly related to the performance of final output.However, in general, the more relevant the retrieved content, the more helpful it is to produce better outputs during the reading step.The observation can also be drawn from the comparison of BM25+FiD and RACO in Tables 3-5.

Effects on Multi-dataset Training
As shown in Table 7, we compare the model performance of retrievers trained by different set of question-document pairs.For example, the first line represents the retriever is trained with only question-document pairs (5,000 in total) from the OBQA dataset.The last line represents using question-document pairs from all six datasets.
From the table, we can observe when the retriever is trained on only one dataset, it might not work well on other datasets because of differences in data distribution.Instead, training with multiple datasets demonstrates better generalizability.

Effects on Commonsense Corpus
To validate the effect of the number and content of corpora on our proposed method, we test the corresponding model performance under different corpora, including choosing a corpus, or any combination of multiple corpora.In Table 8, we show the performance of CSQA2.0 and CREAK on different commonsense corpora.It is worth noting that compared with other data, CSQA2.0 and CREAK can more realistically reflect the impact of different corpora on model performance, mainly because these two datasets are not based on any commonsense knowledge source during the collection process, so the coverage of the problem is much wider than other four datasets that are collects from a certain knowledge source.For example, CSQA1.0 and CommonGen are collected based on ConceptNet.

Effects on Number of Documents
We also compared model performance with different numbers of retrieved documents.As shown in Figure 2, as the number of retrieved documents increases, the model performance of RACO on the CommonGen dataset first increases and then remains unchanged on BLEU-4 or even decreases on SPICE (the primary metric on the CommonGen leaderboard), but the GPU memory consumption increases significantly.This is mainly because when the number of retrieved documents increases, more noisy information might be included in the model, which could hurt the performance of the model.Thus, with reasonable computational overhead, we only use 10 documents in our experiments.

Human Evaluation
We randomly sample 50 generated outputs from the CommonGen dev set (as the test set is not pub-  lic) and 50 generated outputs from the ComVE test set.All evaluations were conducted on Amazon Mechanical Turk (AMT), and each evaluation form was answered by three AMT workers.The generated outputs are evaluated by fluency and accuracy.Fluency is assessed on the grammatical correctness and readability of the generated outputs disregarding the input text.Besides, accuracy evaluates whether the output generated is correct and reasonable given the input text of each task.
As shown in Table 9, our model significantly outperforms baseline methods in terms of accuracy and fluency on both datasets.In particular, the accuracy of the generated output is greatly improved due to the incorporation of the retrieved commonsense knowledge.Furthermore, since all baseline models are pre-trained on large-scale corpora, they all produce outputs with great fluency.However, compared with baseline methods, the outputs generated by our model on the CommonGen dataset still have better fluency.This is mainly because the retrieved references are semantically complete sentences with good fluency, which might mitigate grammatical errors during the generation process.

Generated outputs -T5:
The sun is hot.UNICORN: The sun does not make clothes wet.RACO: The sun would dry a t-shirt but not make a t-shirt wet.

Commongen Input Wordseye look move
Retrieved evidence -#1 She moves her eyes around.#2 The eye looks towards the peaks.#3 A woman looks at the camera as she moves each eye individually.#4 His eyes move across the paper.Generated outputs -T5: Someone looks at someone, then moves his eyes.UNICORN: Someone looks at her and moves her eyes.RACO: A man moves his eyes to look at the camera.
Table 10: Case study.RACo corrects the erroneous predictions of baseline models (e.g., T5 and UNICORN) using the retrieved commonsense knowledge.

Case Study
Table 10 shows two examples with predictions from different models.We demonstrate a "comparison" statement from CSQA2.0 as the first example.As shown in the table, both T5 and UNICORN make wrong predictions, demonstrating a lack of commonsense knowledge.By leveraging the retrieved evidence from commonsense corpus, our proposed RACO can tell the statement "private college is usually smaller than a public university in attendance" is true.In addition, we show an example from counterfactual explanation generation task as the second example.Among the three outputs shown, the explanation generated by T5 is less associated with the input statement.Compared with the generated outputs from UNICORN, our model can generate a semantically richer and more reasonable explanation.This is mainly because the references retrieved provide strong evidence from the perspective of the sun dries things out.

Epilogue
Conclusions.Retrieval-augmented methods have been widely used in many NLP tasks such as open-domain question answering.However, applying this approach to commonsense reasoning has been neglected in the existing literature.In this paper, we systematically investigate how to leverage commonsense knowledge retrieval to improve commonsense reasoning tasks.We collected a commonsense corpus containing over 20 million documents, and proposed novel strategies for training a commonsense retriever.Extensive experiments demonstrate our method can effectively improve the performance of various commonsense reasoning tasks, achieving new state-of-the-art performance on the CommonGen and CREAK leaderboards.
Future work.A natural extension of this work is to leverage heterogeneous knowledge to improve commonsense reasoning tasks, such as combining structured (i.e., knowledge graph) and unstructured (i.e., retrieved text) knowledge.Such a model will require building a graph reasoning module and a textual reasoning module, and merging the knowledge learned from both, which is a challenging task.The second future direction is to learn a commonsense dense retriever without question-document pairs.For example, in binary question answering, the labels are True / False that cannot be used to train a commonsense retriever.
Limitations.There are two main limitations.First, RACO retrieves documents from a large-scale corpus, then leverage the retrieved documents to make predictions.So, compared with baseline methods such as T5 and UNICORN, RACO consumes more time and computing resources.Second, due to the diversity and multi-source nature of commonsense knowledge, the retrieved evidence might contain noisy information that can even hurt model performance.A fine-grained filtering or re-ranking module could be a future work.

A.2 Implementation Details
Retriever.We employed two independent pretrained BERT-base models with 110M parameters (Devlin et al., 2019) as query and document encoders.BERT-base consists of 12 Transformer layers.For each layer, the hidden size is set to 768 and the number of attention head is set to 12.All dense retrievers were trained for 40 epochs with a learning rate of 1e-5.We used Adam (Kingma and Ba, 2015) as the optimizer, and set its hyperparameter ϵ to 1e-8 and (β 1 , β 2 ) to (0.9, 0.999).The batch size is set as 32 on 8x32GB Tesla V100 GPUs.Reader.We employed the FiD (Izacard and Grave, 2021) model that is built up on T5-large (Raffel et al., 2020).For model training, we used AdamW (Loshchilov and Hutter, 2019) with batch size 32 on 8x32GB Tesla V100 or A100 GPUs.We experimented with learning rates of 1e-5/3e-5/6e-5/1e-4 and we found that in general the model performed best when set to 3e-5.All reader models were trained with 20,000 steps in total where the learning rate was warmed up over the first 2,000 steps, and linear decay of learning rate.

A.3 Additional Related Work
Pre-training a language model on commonsense corpora is the most straightforward way to learn commonsense knowledge.Meanwhile, it also helps avoid overfitting when fine-tuned on downstream tasks.When the commonsense materials are appropriately selected, this simple strategy could demonstrate significantly superior performance than vanilla pre-trained language mod-els (Zhou et al., 2021).Notable methods include COMET (Bosselut et al., 2019), CALM (Zhou et al., 2021), Unicorn (Lourie et al., 2021) and etc.For example, COMET initialized its parameters from GPT-2 and post-trained on ATOMIC to adapt its learned representations to knowledge generation, and produces novel knowledge tuples that are high quality (Bosselut et al., 2019).Unicorn initialized its parameters from T5 and performed a multi-task training on six commonsense question answering datasets (Lourie et al., 2021).While this development is exhilarating, such commonsenseaware language models still suffer from the following drawbacks: first, they are usually trained offline, rendering the model agnostic to the latest information, e.g., Covid-19 is a disease caused by a coronavirus discovered in 2019.Second, they make predictions by only "looking up information" stored in its parameters, leading to inferior interpretability (Shuster et al., 2021).
Incorporating knowledge graph (KG) is essential for many commonsense reasoning tasks to augment the limited textual information.A KG often provides comprehensive and rich entity features and relations so models can easily traverse links to discover how entities are interconnected to express certain commonsense knowledge.Some recent work explored using graph neural networks (GNN) to reason over multi-hop relational KG paths, yielding remarkable performance on many commonsense reasoning tasks, such as commonsense question answering (Lin et al., 2019;Yasunaga et al., 2021;Zhang et al., 2022), abductive reasoning (Ji et al., 2020;Yu et al., 2022c), and chit-chat dialogue systems (Zhou et al., 2018;Zhang et al., 2020).The most frequently used KG is ConceptNet.For example, Ji et al. (2020) enriched concept representations in the input text with neighbouring concepts on ConceptNet and performed dynamic multi-hop reasoning on multi-relational paths so the knowledge can be embedded into the hidden representations.Nevertheless, the type of commonsense knowledge is restricted by the relations defined in a knowledge graph schema.Meanwhile, commonsense knowledge lies at an unbounded set of facts and situations that usually cannot be covered by a single knowledge graph.Reasoning over multiple knowledge graph is a challenging task.CSQA2.0 dataset.First, compared to T5, our model can improve by 8.3% accuracy on all dev data (shown in the first column).However, on different statement types, the model performance is different.For example, from the predicted results of T5, the performance on "comparison" state-ments and "condition" statements is below-average.By introducing the retrieved commonsense knowledge, RACO demonstrated significantly better performance on these two sub-categories, achieving 15.3% and 18.5% improvement, which is significantly higher than the average 8.3% improvement.Nevertheless, we also observe the retrieved evidence might provide noisy information, resulting in performance degradation, such as "reason" related statements.We show an example in Table 10.
Statements under these categories are often descriptions or comparisons of factual commonsense, the retrieved documents can thus well complement the necessary knowledge of a given statement.However, some statements require the model to reason in a given scenario, so making correct predictions requires the model to use commonsense knowledge to understand the local contexts.In these statements, retrieved knowledge might even contradict the assumptions, hurting the model performance.

Figure 2 :
Figure2: As the number of retrieved documents increases, the model performance of RACO on the Com-monGen dataset first increases and then remains unchanged on BLEU-4 or even decreases on SPICE (the primary metric on the CommonGen leaderboard).
1. CSQA2.0 Statement -A private college is usually smaller than a public university in attendance.(True) Retrieved evidence -#1 Private schools are usually small and are worth the cost.#2 University's are larger then most colleges.#3 Colleges considered "small" have fewer than 5,000 students Predictions -T5 and UNICORN: False RACO: True 2. ComVE Statement -The sun made my t-shirt wet.Retrieved evidence -#1 The sun can dry wet clothes.#2 The sun can dry something that is wet.

A. 4 Figure 3 :
Figure 3 demonstrates the accuracy of T5 and our RACO for different statement types on the

Table 1 :
Comparison of RACO to a few recent commonsense retrieval works in the field.Our work provides a more comprehensive and larger-scale multi-source commonsense corpus that can generalize to various tasks.

Table 3 :
Compared with commonsense-aware language models (CLM) and knowledge graph reasoning models (KGR) counterparts, our retrieval-augmented commonsense reasoning (RACO) can outperform the baseline methods and achieved state-of-the-art performance on the CommonGen and ComVE benchmarks.*: primary metric.

Table 4 :
RACo achieves better performance than other knowledge-enhanced method counterparts.

Table 5 :
RACO outperforms the baseline methods and achieved state-of-the-art performance on the CREAK.

Table 6 :
Aggarwal et al. (2021)dev sets, measured as the percentage of retrieved documents that contain the ground truth document, which is annotated inMihaylov et al. (2018);Aggarwal et al. (2021).In addition, DPR Wiki directly uses the DPR trained on Wikipedia for commonsense retrieval without any fine-tuning process.DPR RACo trains the commonsense dense retrieval using our proposed training pairs construction strategy.

Table 7 :
Model performance (on dev sets) of using commonsense retrievals trained on different datasets.Training with question-document pairs from all datasets yield the best average performance on six tasks.*: primary metric.

Table 8 :
Performance on dev sets of retrieving commonsense knowledge from different size of corpus.

Table 9 :
Human evaluations by independent scoring based on accuracy and fluency.* indicates p-value < 0.05 under paired t-test between RACO and baselines.