Improving Unsupervised Commonsense Reasoning Using Knowledge-Enabled Natural Language Inference

Recent methods based on pre-trained language models have shown strong supervised performance on commonsense reasoning. However, they rely on expensive data annotation and time-consuming training. Thus, we focus on unsupervised commonsense reasoning. We show the effectiveness of using a common framework, Natural Language Inference (NLI), to solve diverse commonsense reasoning tasks. By leveraging transfer learning from large NLI datasets, and injecting crucial knowledge from commonsense sources such as ATOMIC 2020 and ConceptNet, our method achieved state-of-the-art unsupervised performance on two commonsense reasoning tasks: WinoWhy and CommonsenseQA. Further analysis demonstrated the beneﬁts of multiple categories of knowledge, but problems about quantities and antonyms are still challenging.


Introduction
Recently, the task of commonsense reasoning has attracted much attention, as believed to be a critical and yet challenging component of human-level intelligence (Levesque et al., 2012, Davis, 2017Wang et al., 2019a). To test models' ability to understand natural language and reason with external commonsense knowledge, efforts have been made towards building many challenging WSClike (Winograd Schema Challenge) tasks and QA (question-answer) tasks. Specifically, (Zhang et al., 2020a) crowd-sourced human-provided justifications as reasons for the WSC problems, resulting in a new dataset called WinoWhy. An example of WinoWhy is shown in Table 1, the model is asked to determine whether the given reason for the WSC problem is correct. Meanwhile, constructed based on ConceptNet (Speer et al., 2017), Comon-senseQA (Talmor et al., 2019) is designed as a fivechoice QA dataset that requires model to capture * Corresponding Author A WinoWhy Example WSC Question: Joan made sure to thank Susan for all the help she had received. She refers to Joan because Reason: Joan is doing the thanking so she must have received the help. Label: Positive Convert WinoWhy to NLI Premise: Joan is doing the thanking so she must have received the help. Hypothesis: Joan made sure to thank Susan for all the help Joan had received. Label: entailment Table 1: A WinoWhy example consists of WSC question and reason, while the label is "Positive" or "Negative". We use NLI as a common task and convert WinoWhy to NLI form.
the relation between the question and the correct answer. In this work, we experiment on these two commonsense datasets.
Although diverse methods based on pre-trained language models and external knowledge have shown very strong supervised performance on commonsense reasoning, the solution process is usually complex and expensive. Generally, we should gather task-specific training data and then train models to learn the patterns in data. However, as shown in WinoGrande , acquiring unbiased labels requires a carefully designed crowd-sourcing procedure, which greatly adds to the cost of data collection. Moreover, supervising on large training sets is usually timeconsuming. Therefore, instead of applying specific methods to the corresponding task, a reasonable framework is to convert diverse commonsense reasoning tasks to a common task and use a general unsupervised method to solve it. Furthermore, some tasks that lack sufficient annotations can be solved by the framework.
We attempt to use Natural Language Inference (NLI) as the common task mentioned above. NLI is the task of determining whether a hypothesis is "entailment" or "not entailment" to a given premise. Figure 1: Overview of our NLI framework with injected knowledge from the knowledge base. The NLI-LM and NLI-Classifier denote LM with a classification head fine-tuned on NLI. We convert the original example of the source task, e.g., WinoWhy or CommonsenseQA, to NLI form and combine KB sentences as input.
NLI task is well-suited to be a common task, as it assembles the skills involved in sentence understanding, from the resolution of syntactic ambiguity to pragmatic reasoning with world knowledge (Wang et al., 2019b). Furthermore, the NLI task has been actively studied, especially since the emergence of large-scale datasets (Bowman et al., 2015;Williams et al., 2018), and we can directly leverage the progress. Moreover, we explore whether injecting external knowledge from knowledge bases to our framework can enhance the model's performance over commonsense reasoning tasks.
We apply RoBERTa , which has shown powerful performance on NLI tasks, as the backbone network of our NLI framework. We first convert a commonsense reasoning task to an NLI form as the original input. As shown in Table 1, we replace the pronoun "she" of the original WSC sentence with the correct candidate "Joan" and treat the replaced sentence as the hypothesis, while the given reason is the premise. Next, to leverage external knowledge, we use the recently-introduced ATOMIC 2020 (Hwang et al., 2020) and Concept-Net as knowledge bases (KBs). Specifically, we extract KB triples from KB by matching semantic similarity between the embeddings of KB and the source task and then combine the triples and the original input for RoBERTa. Our experimental results on WinoWhy and CommonsenseQA suggest that the NLI framework is suitable for commonsense tasks and external knowledge can provide useful information to help the model make the correct prediction. Furthermore, more improvements can be obtained by combining multiple effective categories of knowledge. In addition, models perform worse when facing problems about quantity knowledge and antonym relation.

Method
In this section, we describe the details of 1) using the NLI framework to solve commonsense reason-ing tasks and 2) extracting knowledge from KBs, and 3) injecting the external knowledge into the NLI framework. The overview of our framework is shown in Figure 1.

NLI Task: A General Framework
The key to solving commonsense reasoning tasks such as WinoWhy and CommonsenseQA is to determine the relation between question-answer pairs. Follow this intuition, we use a general task NLI, which aims at identifying whether a hypothesis sentence can be entailed by a premise sentence. We first convert the original example of the source task to the NLI form. In this work, we define the source task as to predict whether an answer is entailed given a question. We can find that the question corresponds to the premise and the answer to the hypothesis. Moreover, for source tasks like WinoWhy, we can also try to convert the question (e.g., WSC question shown in Table 1) to a statement as the hypothesis, and treat the reason as a premise, following the if-then relation. Then, we use pre-trained language models (LM) with a classification head to solve the NLI task. Specifically, given a premise and a hypothesis, we concatenate them as the "NLI sentence": [CLS] Premise [SEP] Hypothesis [SEP]. The LM with the classification head then predicts the entailment relation.
To mitigate the data scarcity in an unsupervised setting, we consider transferring knowledge from large NLI datasets. Specifically, we fine-tuned the LM and classification head on either MNLI (Williams et al., 2018) or QNLI (Wang et al., 2019b). We use the RobertaForSequenceClassification from the transformers library (Wolf et al., 2020). It is the RoBERTa with a classification head on top. When evaluating our framework on source task, because MNLI has three labels: "entailment", "neutral", and "contradiction", we treat the last two labels as "not entailment". We denote the LM and classification head fine-tuned on NLI datasets as NLI-LM and NLI-classifier.

Inject External Knowledge
In the following, we show how to extract knowledge from KB and inject the matched knowledge into the NLI framework. We first convert the triples in KB to natural language sentences and extract KB triple from KB by calculating cosine similarity between the embeddings of KB sentence and source task example. Finally, we combine the external KB sentence and original example to help NLI-LM and NLI-Classifier perform the correct prediction.
Convert KB Triple to Natural Language Inspired by ATOMIC (Sap et al., 2019a), which is unique in that the entity in a triple is mostly short sentences, we try to convert KB triple to natural language sentence, then capture helpful knowledge for original example by matching semantic similarity between them. For example, (Per-sonX thanks PersonY afterwards, isAfter, PersonX asked PersonY for help on her homework), a triple in ATOMIC, can be extended to "After PersonX asked PersonY for help on her homework, PersonX thanks PersonY afterwards", and (having_no_food, CausesDesire, go_to_a_store), a triple in Concept-Net, corresponds to "having no food makes someone want go to a store". In this work, we use ATOMIC 2020 (we call it ATOMIC for short in the following) and ConceptNet as knowledge bases. We define templates for every relation in ATOMIC and ConceptNet. Then we convert the triples in ATOMIC and ConceptNet to natural language sentences automatically using the templates. We named the natural language sentence "KB sentence".
Extract Knowledge from KB Inspired by (Reimers and Gurevych, 2019), we use NLI-LM to generate the token embeddings of an NLI sentence or a KB sentence. Then we compute the mean of all token embeddings. As shown in Figure 2, an NLI sentence and a KB sentence are input into NLI-LM, and two mean embeddings are output. Then we calculate the cosine similarity between two embeddings as the semantic similarity of two input sentences. When input into NLI-LM, a NLI sentence is composed of the form "[CLS] Premise [SEP] Hypothesis [SEP]" and a KB sentence is wrapped by [CLS] and [SEP] as well. When evaluating our framework on a commonsense dataset, for each example, we extract the KB sentences with TopK semantic similarity.

Inject KB Sentence into NLI Sentence
To combine a KB sentence and an NLI sentence, we inject the KB sentence into the middle of the NLI sentence to form a combined sentence. Thus, the form of the combined sentence is " For an NLI sentence with TopK matched KB sentences, we can generate K combine sentences. All of them are input into NLI-LM and K CLS-token embeddings are output. Then we compute the mean of all CLS-token embeddings. Finally, the mean embedding is input into the NLI-Classifier and the entailment relation is output.

Tasks
We evaluate our framework on two commonsense reasoning datasets, WinoWhy and Common-senseQA. Both require commonsense knowledge beyond textual understanding to perform well. All of our experiments use an unsupervised setting, i.e., our model does not train on the source task.
WinoWhy (Zhang et al., 2020a) contains 2,865 reasons, which belongs to 273 WSC examples respectively. We evaluate models on the full set. A WinoWhy example consists of a WSC question and reason. We use two strategies to convert an example to an NLI sentence. As the example shown in CommonsenseQA is a multiple-choice QA dataset that specifically measures commonsense reasoning. This dataset is constructed based on ConceptNet. We evaluate models on the development set with 1,221 questions since the answers to the test set are not publicly available. A Com-monsenseQA example consists of a question and 5 choices. We regard the question and a choice as a NLI sentence with the form "[CLS] Q: question [SEP] A: choice [SEP]" (The additional "Q" and "A" follows the recommendation from the FairSeq repo on how to fine-tune RoBERTa on CommonsenseQA 1 ). Then entailment score of every choice is calculated. Finally, the choice with the highest score is selected as the answer to the question. In addition, the form of combine sentence is "

Knowledge Bases
ATOMIC (Sap et al., 2019a) is a knowledge base consists of 880K of triples across 9 relations that cover social commonsense knowledge, e.g., (X gets X's car repaired, xIntent, to maintain the car), including aspects of events such as mental states, personal attribute, and social effect. As the later work, (Hwang et al., 2020) extends ATOMIC to ATOMIC 2020 with 1.33M triples. ATOMIC 2020 introduces 23 commonsense relations. Triples are of the form ({Event | Entity}, r, {Entity | Event | Behavior | Persona | Mentalstate}), where head and tail are nouns or short sentences and r represents an if-then relation type or physical property (e.g., xIntent and ObjectUse). We define 23 templates for every relation in ATOMIC 2020 to automatically convert triple to natural language sentences.
ConceptNet (Speer et al., 2017) is a knowledge base focus mostly on taxonomic and lexical knowledge (e.g., IsA, PartOf) and physical commonsense knowledge (e.g., MadeOf, UsedFor). We extracted 29 relations to form a subset with 485K entityrelation triples. Similar to ATOMIC, we define 29 templates for every relation. In this work, we   Table 3: Performance comparison on the dev set of CommonsenseQA. The accuracies of Self-Talk and SMLM are reported by (Shwartz et al., 2020) and (Banerjee and Baral, 2020). "BERT-Base Sup." denote the base model of BERT training on CommonsenseQA training set and the result is the accuracy of the test set reported by the official leaderboard.
use ATOMIC 2020 and ConceptNet for injecting external knowledge to NLI framework.

Baselines
For WinoWhy, we consider the pre-trained lan-  +WinoGrande. We directly use the results reported by (Zhang et al., 2020a).
Same as WinoWhy, we use RoBERTa as baselines for CommonsenseQA. Specifically, we use RobertaForMaskedLM from the transformers library (Wolf et al., 2020). It can be regarded as a RoBERTa Model with a masked language modeling head on top. Given a CommonsenseQA question and one of the five choices, we mask the choice tokens and use the masked LM head to predict them. For example, a CommonsenseQA sentence input to model consists of the form: "[CLS] question [SEP] choice [SEP]". Then the choice will be masked and the masked LM head is used to predict the cross-entropy loss for it. Finally, the choice with the lowest loss will be selected as the answer to the question.
In addition, we compare our model with Self-Talk (Shwartz et al., 2020) and SMLM (Banerjee and Baral, 2020). These two models both propose an unsupervised framework to multiplechoice commonsense tasks and show considerable improvements over large pre-trained language models. So we report their dev-set accuracies on Com-monsenseQA as baselines. Table 2 and Table 3 show results of applying NLI framework and external knowledge to WinoWhy and CommonsenseQA. Our framework has achieved state-of-the-art (SOTA) unsupervised performance on WinoWhy by a large margin. Specifically, using the same language model RoBERTa, we observed improvements ranging from +8.52% (66.70% by Base+MNLI+CN)  As for results on CommonsenseQA, we first observe that RoBERTa is struggling near the Random Guess baseline. This result illustrates that RoBERTa completely cannot deal with Common-senseQA without training. However, after converting CommonsenseQA to NLI form and injecting KB sentences, RoBERTa behaves a lot better. RoBERTa-Base + MNLI + CN/ATOMIC gets a comparable result compared to Self-Talk, while RoBERTa-Base + QNLI + CN/ATOMIC have already exceeded SMLM, the previous SOTA method. Finally, it is interesting to note that RoBERTa-Large + QNLI + ATOMIC is slightly worse than BERT-Base model training on the Com-monsenseQA training set. Now we focus on results applying the NLI framework without injected knowledge. For WinoWhy, RoBERTa can achieve a considerable boost after being fine-tuned on either QNLI or MNLI. For CommonsenseQA, RoBERTa fine-tuned on QNLI can get +4.62% higher dev-set accuracy than Self-Talk and comparable result to SMLM. The experiment clearly illustrates the effectiveness of the NLI framework and transfer learning from NLI datasets.

Main Results
When we inject KB sentences to RoBERTa finetuned on QNLI, improvement can be observed on full-set accuracy for WinoWhy and dev-set accuracy for CommonsenseQA. This indicates that the knowledge from QNLI and that extracted from KB complement each other. On the other hand, external knowledge, either from ATOMIC or ConceptNet, is not much help to RoBERTa fine-tuned on MNLI and even causes a drag. We hypothesize that there Figure 3: Effect of K on WinoWhy and Common-senseQA. We inject KB sentences with topK similarity into NLI-LM. The figure above is the results on WinoWhy, while the below is on CommonsenseQA. The "Base" and "Large" denote RoBERTa-Base/Large + QNLI. is a high overlap or even contradiction between the knowledge of KB and MNLI, which causes the incompatibility between them.
In summary, We think the reasons leading to the significant improvement are 1) NLI framework is better suited for such tasks; 2) RoBERTa picks up necessary knowledge from the NLI datasets; 3) ATOMIC and ConceptNet provide some useful information to source tasks and help models make the correct prediction.

Ablation Study
Category of Injected KB Sentences In order to study whether the different categories of external knowledge will have a large impact on the model's performance, we divide ATOMIC into five categories: Physical-Entity, Event-Centered, Mental-State, Persona, and Behavior, basically following the definition of (Hwang et al., 2020). Physical-Entity deals with inferential knowledge about common entities and objects. Event-Centered provides intuitions about how common events are related to one another. Mental-State addresses the emotional or cognitive states of the participants in a given event. Persona describes a person's attribute as perceived by others given an event. Behavior address the socially relevant responses to an event. We inject each category separately into RoBERTa-Large + QNLI (the best NLI-LM in our experiment). The results are shown in Table 4. Similar to ATOMIC, we divide ConceptNet into four categories: Physical-Entity, Event-Centered, Social-Interaction, and Taxonomic-Lexical. The meanings of the former two categories are the same as ATOMIC. Social-Interaction focuses on socially triggered states and behaviors. Taxonomic-Lexical focus on taxonomic and lexical. The results are shown in Table 5. It is not surprising that there are some categories of knowledge dragging down the performance. For example, for ATOMIC, injecting the knowledge of Physical-Entity obtains the worst results on either WinoWhy or CommonsenseQA.
Next, we wonder if we can get higher accuracies after combining the effective categories. So we combine the two most effective categories for each task. On ATOMIC, they are Behavior and Event-Centered. On ConceptNet, they are Taxonomic-Lexical and Event-Centered for WinoWhy, while Physical-Entity and Event-Centered for Common-senseQA. The results show that this strategy makes the performance exceed the "Overall". The slight boost of accuracies illustrates that our assumption is correct. We also find that combining any two categories does not necessarily work through the results of ConceptNet. For example, combining Taxonomic-Lexical and Event-Centered does not get a higher result than "Overall" on Com-monsenseQA, because of the bad performance of Taxonomic-Lexical. It tells us that we need to identify effective knowledge when combining different categories.

Amount of Injected KB Sentences
As mentioned before, we extract KB sentences with topK similarity. Now we investigate the impact of hyperparameter K with experimental results shown in Figure 3. The results generally follow the intuition that the more knowledge is injected, the better the performance is until the amount of injected sentence reaches a threshold. Then the accuracy begins to decrease. We think the reason is that knowledge with lower semantic similarity introduces noise to the model and then plays a distracting effect.

Discussion
To discuss the performance when the model faces different knowledge types, we follow the knowledge types defined in (Zhang et al., 2020a) and divide WinoWhy into five subsets. We evaluate RoBERTa-Base/Large + QNLI on each subset. The results are shown in Table 6. "Property" denotes the knowledge about the property of objects. "Object" represents that about objects. "Eventuality", "Spatial" and "Quantity" corresponding to eventualities, spatial position, and numbers, respectively. Comparing RoBERTa-Large fine-tuned on WinoGrande (the best model reported by Zhang et al., 2020a) and RoBERTa fine-tuned on QNLI, the latter goes beyond the former on all knowledge types. It is no doubt that QNLI contains more commonsense knowledge needed by WinoWhy than WinoGrande. Now let us focus on the comparison between models with and without KB sentence. It is shown that KB sentences matched for WinoWhy examples can provide some performance boost on most knowledge types, suggesting that we successfully inject the effective knowledge from KB to RoBERTa. Further, we can find that whether for RoBERTa + QNLI or RoBERTa + QNLI + KB, the worst performances appear on "Quantity". What's more, the injected knowledge, either from ATOMIC or Con-ceptNet, brings the lowest benefit to "Quantity", and even results in the only drag for RoBERTa-Large + QNLI + CN (-1.00). The reason for this result may be due to the lack of knowledge about numbers in QNLI, ATOMIC, and ConceptNet. We can find that large corpora do often lack quantity knowledge. This gives us the idea that constructing and encoding quantity knowledge into LM in the future.
Similar to WinoWhy, we follow the experiment described in (Ma et al., 2019) and divide Com-monsenseQA into five subsets. We classify questions based on the ConceptNet relation between the question concept and the correct answer concept. Then we select the relations with more than 40 questions as knowledge types. Observing experimental results shown in Table 7, we can derive the same conclusion as WinoWhy that injecting knowledge following our method can provide useful information to LM and help make the correct decision. However, accuracies on "Antonym" are the lowest compared with other knowledge types. And the boosts are also the lowest after injecting knowledge. "Antonym" denotes that A and B are opposites in some relevant way, such as black and white. We guess it is because the language model has a weak ability to deal with antonym relations.
In addition, we find that ATOMIC can bring more benefits to RoBERTa compared with ConceptNet. As described in (Hwang et al., 2020), triples in ConceptNet are limited to mostly taxonomic, lexical, and object-centric physical knowledge, making the commonsense portion of ConceptNet relatively small. While ATOMIC has more knowledge related to social commonsense, and relatively, the coverage is more extensive and balanced. Our experimental results are consistent with these descriptions.

Related work
Commonsense Reasoning Recent commonsense reasoning datasets Zhou et al., 2019;Sap et al., 2019b;Bisk et al., 2020;Talmor et al., 2019 ) have motivated research in several domains of commonsense: abductive, temporal, social, and physical. SOTAs for most of them have achieved over 80% accuracy, which is close to human performance (e.g., Brown et al., 2020;Khashabi et al., 2020;Raffel et al., 2020). However, their success is due to larger pre-trained corpora and much more parameters, which is difficult to be followed for most researchers. In addition, other useful methods (Yasunaga et al., 2021;Feng et al., 2020; generally require training on training sets and knowledge graphs. When applying them to different tasks, the same running and tuning process should repeat for several times to find the best fit. Thus, we propose a framework to convert diverse commonsense reasoning tasks to a common task, NLI, and use a general unsupervised method to solve it. Natural Language Inference Since GLUE regards NLI as a benchmark task for testing the natural language understanding capability of the model, NLI has been well studied, and language models have achieved performance beyond humans on some NLI datasets. Furthermore, by leveraging transfer learning from large NLI datasets, great performances have been achieved in several tasks, such as story ending prediction (Li et al., 2019), intent detection (Zhang et al., 2020b), semantic textual similarity (Reimers and Gurevych, 2019). Therefore, we attempt to use NLI as the common task to solve commonsense reasoning.
External Knowledge Most commonsense reasoning tasks require models to synthesize external commonsense knowledge and leverage more sophisticated reasoning mechanisms. The key is to extract effective information from commonsense sources, such as ATOMIC, ConceptNet, and Wikipedia. Methods learn commonsense knowledge either by KGs pre-training Ye et al., 2019) or by reasoning on knowledge graphs (Feng et al., 2020;Lv et al., 2020;Lin et al., 2019). In order to cooperate with our NLI framework, we convert the triples in KB to natural language sentences and extract triples by calculating cosine similarity between the embeddings of KB sentence and source task example.

Conclusion
In this work, we propose a framework to convert diverse commonsense reasoning tasks to a common task, NLI and use a pre-trained language model, RoBERTa to solve it. By leveraging transfer learning from large NLI datasets, QNLI and MNLI, and injecting crucial knowledge from knowledge bases such as ATOMIC and ConceptNet, our framework achieved SOTA unsupervised performance on two commonsense reasoning tasks: WinoWhy and CommonsenseQA. Experimental results show that knowledge from QNLI and extracted from either ATOMIC or ConceptNet can complement each other to enhance the model's performance on commonsense reasoning. More improvements can be obtained by combining multi categories of effective knowledge. Further experiment shows that ATOMIC can bring more benefits to RoBERTa compared with ConceptNet. However, injected knowledge is not much help to RoBERTa finetuned on MNLI and even causes a drag. In addition, models perform worse when facing problems about quantity knowledge and antonym relation. The code is publicly available: https: //github.com/sysuhcm/NLI-KB.