BUCA: A Binary Classification Approach to Unsupervised Commonsense Question Answering

Unsupervised commonsense reasoning (UCR) is becoming increasingly popular as the construction of commonsense reasoning datasets is expensive, and they are inevitably limited in their scope. A popular approach to UCR is to fine-tune language models with external knowledge (e.g., knowledge graphs), but this usually requires a large number of training examples. In this paper, we propose to transform the downstream multiple choice question answering task into a simpler binary classification task by ranking all candidate answers according to their reasonableness. To this end, for training the model, we convert the knowledge graph triples into reasonable and unreasonable texts. Extensive experimental results show the effectiveness of our approach on various multiple choice question answering benchmarks. Furthermore, compared with existing UCR approaches using KGs, ours is less data hungry.


Introduction
Commonsense reasoning has recently received significant attention in NLP research (Bhargava and Ng, 2022), with a vast amount of datasets now available (Levesque, 2011;Gordon et al., 2012;Sap et al., 2019;Rashkin et al., 2018;Bisk et al., 2020;Talmor et al., 2019).Most existing methods for commonsense reasoning either fine-tune large language models (LMs) on these datasets (Lourie et al., 2021) or use knowledge graphs (KGs) (Pan et al., 2017) to train LMs (Liu et al., 2019a;Yasunaga et al., 2022).However, it is not always possible to have relevant training data available, it is thus crucial to develop unsupervised approaches to commonsense reasoning that do not rely on labeled data.
In this paper, we focus on the unsupervised multiple choice question answering (QA) task: given a question and a set of answer options, the model is expected to predict the most likely option.We John wanted to be a better dancer.
John wanted to be social with their friends.propose BUCA, a binary classification framework for unsupervised commonsense QA.Our method roughly works as follows: we first convert knowledge graph triples into textual form using manually written templates, and generate positive and negative question-answer pairs.We then fine-tune a pretrained language model, and leverage contrastive learning to increase the ability to distinguish reasonable from unreasonable ones.Finally, we input each question and all options of the downstream commonsense QA task into BUCA to obtain the reasonableness scores and select the answer with the highest reasonableness score as the predicted answer.Experimental results on various commonsense reasoning benchmarks show the effectiveness of our proposed BUCA framework.Our main contributions are: • We propose a binary classification approach to using KGs for unsupervised commonsense question answering.
• We conduct extensive experiments, showing the effectiveness of our approach by using much less data.

Related work
Language models are widely used in unsupervised commonsense inference tasks, e.g. as an additional knowledge source or as a scoring model.Rajani et al. (2019) propose an explanation generation model for the CommonsenseQA dataset.Self-talk (Shwartz et al., 2020) uses prompts to stimulate GPT and generate new knowledge.SEQA (Niu et al., 2021) generates several candidate answers using GPT2 and then ranks each them.Another research direction in unsupervised commonsense reasoning is the use of e.g.commonsense KGs (Speer et al., 2016;Romero et al., 2019;Malaviya et al., 2020) to train the model (Chen et al., 2021;Geng et al., 2023).In Banerjee and Baral (2020), given the inputs of context, question and answer, the model learns to generate one of the inputs given the other two.Ma et al. (2021) update the model with a margin ranking loss computed on positive and negative examples from KGs. MICO (Su et al., 2022) uses the distance between the positive and negative question-answer pairs obtained from the KG to calculate the loss.However, all of the above approaches demand a large amount of training data, sometimes reaching million of training samples, while BUCA only needs tens of thousands, cf.Table 2.The most similar to our work is NLI-KB (Huang et al., 2021), which trains a model on NLI data, then applies the corresponding knowledge to each question-answer pair on the downstream task.Our paper, instead, shows that is not the NLI data but the retrieved knowledge that helps.

Methodology
We focus on the following multiple choice question answering (QA) task: given a question q and a set of options A, the model should select the most likely single answer A i ∈ A. We consider an unsupervised setting in which the model does not have access to the training or validation data.Our BUCA approach first trains the model with a knowledge graph and then uses the trained model to test on multiple QA downstream tasks.Formally, a knowledge graph (KG) (Pan et al., 2017) G is a tuple (V, R, T ), where V is a set of entities, E is a set of relation types and T is a set of triples of the form (h, r, t) with h, t ∈ V the head and tail entities and r ∈ R the relation of the triple connecting h and t.
Our approach has three main components: knowledge graph transfer to training data, training loss design, and downstream task testing:

Converting Triples into Binary Classification
Training Data.Inspired by previous work (Su et al., 2022), each KG triple is converted into question-answer pairs by using pre-defined templates, so that the obtained pairs are then used as the input of the classification task.We use the templates provided in (Hwang et al., 2020).For example, the ATOMIC triple (PersonX thanks PersonY afterwards, isAfter, PersonX asked PersonY for help on her homework) can be converted to "After Per-sonX asked PersonY for help on her homework, Per-sonX thanks PersonY afterwards".In the appendix we show the distribution of the converted sequence pairs.Along with the correct QA pairs created from the KG triples, our framework is also trained on negative QA pairs, so it can better discriminate between reasonable and unreasonable QA pairs.More precisely, in the training dataset, each correct QA pair generated from a triple tp = (h, r, t) has a corresponding negative pair obtained from a variation of tp in which t is substituted by t ′ , which is randomly drawn from the existing tails in the KG.
Training Loss.For our binary classification model, we add a classification head with two nodes to the pre-trained language model.After normalizing the values on these two nodes, we can obtain reasonable and unreasonable scores for the QA pairs.From the triple conversion step, we obtained n training examples, each consisting of a question q, correct answer a c , and incorrect answer a w .For each question-answer pair, we can then obtain the reasonable and unreasonable scores r + i and r − i after applying a softmax layer.In each loss calculation, we jointly consider the correct and incorrect answers.For binary classification, we use two kinds of losses: Traditional Binary Loss (TBL).
where p + ac and p − aw are the probabilities of correct and incorrect answers, respectively corresponding to reasonable and unreasonable scores.Margin Ranking Loss.
where η is a margin threshold hyper-parameter.
In order to pull the representational distance between reasonable question-answer pairs as close as possible and to push the representational distance between reasonable and unreasonable ones as far as possible, we use supervised contrastive learning (Gunel et al., 2021) along with the binary classification.This is done by considering as positive examples of a given example within a category, all those examples within the same category.
Contrastive Loss of the i-th QA pair where τ is the temperature parameter and h denotes the feature vector.
Inference.In the prediction phase for each candidate answer, we calculate its reasonableness score.
We choose the answer with the highest reasonableness score as the predicted answer.

Experiments
In this section, we first describe our experiments on five commonsense question answering datasets, followed by ablation studies and data analysis.

Datasets and Baselines
We use two well-known commonsense KGs for training our framework: ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2018).

Main results
Table 1 shows the results for the five benchmarks.On other datasets our framework shows similiar behavior with both KGs.As for the loss functions, the margin ranking loss is on average 0.8% higher than the binary loss on ConceptNet, and 0.1% higher on ATOMIC.These results are explained by the fact that the ranking loss separates more the scores between reasonable and unreasonable answers.In light of this, we will only consider margin ranking loss in the below analysis.

Ablation Studies
In this section, we analyze the effects of the backbone models, the effect of contrastive learning, and explore the vocabulary overlap between the knowledge training set and the downstream task as well as the accuracy of our BUCA method.
Backbone Pre-trained LMs Our experiments using different backbone models show that in general the stronger the PLM the better the perfor-mance on the downstream task.Regarding the KGs, in the BERT-base and RoBERTa-base variants, the ATOMIC-trained models perform better than the ConceptNet-trained models, while in the RoBERTa-large one they perform similarly.This might be explained by the fact that as the model capacity increases it has more inherently available event-like commonsense knowledge, necessary in the ATOMIC-based datasets.Detailed results are shown in Table 3.

Effects of Contrastive Learning
Our experiments show that the RoBERTa-large variant with contrastive learning outperforms the version without it on all datasets, regardless of the used KG.Detailed results are shown in Table 4.
Accuracy of the Binary Classifier Inspired by Ghosal et al. (2022), we evaluate how often input sequences corresponding to correct and incorrect answers are accurately predicted.To this end, we use the RoBERTa-large variant trained on ATOMIC.Table 5 shows that our model tends to predict all answers as reasonable since in our training set the negative examples are randomly selected, many QA pairs are semantically irrelevant or even ungrammatical.For the manually crafted candidate answers, many of them are semantically relevant and grammatical, so our model predicts them as reasonable.We also see that the accuracy metrics for SCT and COPA are the highest.Our findings are consistent with Ghosal et al. (2022).

Data Analysis
To better understand why transfer learning from CKGs is more suitable than from other datasets (i.e.MNLI or QNLI) in the commonsense QA task, we performed an analysis on the training data in NLI-KB (Huang et al., 2021) and the used CKGs.Following (Mishra et al., 2021), we first compare the vocabulary overlap of ConceptNet, ATOMIC and MNLI (training data) with our evaluation QA datasets.We follow the definition of overlap introduced in (Mishra et al., 2021).Table 6 shows that MNLI has higher vocabulary overlap with all the evaluation datasets than both used CKGs.However, the results for NLI-KB in Table 1 show that the vocabulary overlap is not a key factor for performance as otherwise, NLI-KB fine-tuned with the NLI datasets (before injecting knowledge) should perform better that the other models in the downstream task due to the high lexical similarity.We also analyze the distance to the sentence embeddings.Our results show that the MNLI entries performed poorly in commonsense knowledge retrieval for SIQA-queries as they are not reasonable answers.In contrast, the sentences generated from ATOMIC and ConceptNet successfully pair the SIQA-questions with reasonable answers.This reveals that, although MNLI has a higher lexical coverage, MNLI does not have suitable examples to match SIQA questions.Thus models fine-tuned with the NLI dataset hardly get any benefit for downstream commonsense reasoning tasks.Tables 7 and 8 present a random sample showing this, where reasonable alternatives are in bold.

CSQA Example
Question: If you have leftover cake, where would you put it?Answer: refrigerator

MNLI
In the waste-paper basket.This entails in the garbage bin.
In the middle of the dinner plate (or is it a base drum?)This entails in the center of the dinner plate.
We always keep it in the hall drawer.This entails it's always kept in the drawer in the hall.

ATOMIC
John cuts the cake.as a result, John wants put the rest of the cake in fridge John places in the oven.but before, John needed to mix the cake ingredients John puts in the fridge.but before, John needed to grab it off the table

ConceptNet
oven is the position of cake refrigerator is the position of moldy leftover fridge is the position of leftover

Conclusion
We presented a framework converting KGs into positive/negative question-answer pairs to train a binary classification model, discriminating whether a sentence is reasonable.Extensive experiments show the effectiveness of our approach, while using a reasonably small amount of data.For future work, we will explore how to better select negative cases.

Limitations
The method to select negative examples could be improved, as randomly selecting negative examples for training might lead to identifying most of examples in the evaluation datasets as reasonable.Secondly, we did not explore using other number of candidates in the training set, we always use 2 candidate answers for each question.
Choice of Plausible Alternatives (COPA) (Gordon et al., 2012) is a two-choice question-answer dataset designed to evaluate performance in opendomain commonsense causal reasoning.Each entry contains a premise and two possible answers, the task is to select the answers that most likely have a causal relationship with the premise.The dataset consists 500 questions for both debvelopment and test sets.

B Ablation Studies
We present the full results for the ablation studies discussed in Section 4.3.Table 3 for the backbone models study; Table 4 for the influence of contrastive learning; and Table 5 for accuracy.

C Data Analysis
In the analysis of the distance to sentence embeddings, we treat each entry in the CKG datasets as possible answers and encode them using the SBERT pre-trained model (all-mpnet-base-v2) (Reimers andGurevych, 2019, 2020).Then, the cosine-similarity between the SIQA question and the encoded sentences is calculated to rank their semantic relatedness.We retrieved the top 3 answers for each source and listed by similarity score at descending order.Table 10 extends the results presented in Section 4.4; Table 11 show the alternative answers from CKG datasets COPA questions.

SIQA Example
Question: After a long grueling semester, Tracy took the final exam and finished their course today.Now they would graduate.Why did Tracy do this?Answer: complete their degree on time MNLI Because I had a deadline.This entails I had to finish by that time.
The professors went home feeling that history had been made.This entails The professors returned home.
They got married after his first year of law school.This entails Their marriage took place after he finished his first year of law school.

COPA Example
Question: The boy wanted to be muscular.As a result, Answer: He lifted weights.

MNLI
Emboldened, the small boy proceeded.This entails the small boy felt bolder and continued.
Out of shape, fat boy.This entails the boy was obese.
When Sport Resort won the contract for the construction of a new hotel center for 1200 people around the Olympic Sports Arena (built as a reserve for the future, to have it ready in time for the next championships), Gonzo began to push his weight around, because he felt more secure.This entails when Sport Resort won the contract for the construction of a new hotel Gonzo felt more secure.

ATOMIC
John wanted to build his physique.as a result the boy lifts weights The boy starts working out.as a result, the boy wants to gain more muscle The boy starts lifting weights.as a result, the boy will build muscle Enhancement SocialIQA Since they wanted to learn the intricate new dance, Carson watched Remy's movements.Why did Carson do this?reasonable score seperately (A) do the dance on their own (0.5) (B) learn to dance (0.8) (C) dance well (0.6) John watches other dance, Why?

Figure 1 :
Figure 1: After BUCA is trained on the above question from the training set, it is then able to rate the reasonableness of each sentence of the downstream task.
Tracy wants finish before time expires.because Tracy takes the exam Tracy wanted to get a degree.as a result Tracy finishes Tracy's test Tracy graduates with a degree.but before, Tracy needed get pass with good marks.ConceptNet pass class causes graduation study ends with the event or action graduate graduation because take final exam

Table 1 :
Accuracy (%) on five public benchmarks.Our best scores are highlighted in bold, and the results for the best performing baseline are underlined.Recall that TBL and MRL refer to the loss functions used in BUCA.

Table 2 :
Statistics for the training and validation data used by Ma, MICO and BUCA.

Table 4 :
The influence of contrastive learning OpenBookQA and SCT, but it achieves state-ofthe-art results on CSQA 67.4 and on SIQA 63.2, while BUCA's best results respectively are 65.4 and 61.4.However, Ma uses multiple KGs to train a single model, ConceptNet, WordNet, and Wikidata for CSQA and ATOMIC, ConceptNet, WordNet, and Wikidata for SIQA, with a total training data of 662,909 and 1,197,742, while BUCA only uses 65,536 and 61,530, cf.Table2.Considering the difference on used training data and the closeness of results, BUCA's approach clearly demonstrates its effectiveness.We can also observe the same trend as in MICO: ConceptNet is more helpful for CSQA and ATOMIC is more helpful for SIQA.This is explained by the fact that SIQA is built based on ATOMIC and CSQA is built based on ConceptNet.
We also outperform MICO by 5.4% on SIQA; NLI-KB by 13.3% on CSQA, and NLI-KB by 16.3% on SCT.Ma does not provide results for COPA,

Table 5 :
The Neg and Pos column indicate % of instances for which all answer choices are predicted as negative or positive.The Incor as Neg, Cor as Pos, and Accurate column indicate % of instances for which all incorrect answers are predicted as negative, the correct answer is predicted as positive, and all answers are predicted accurately as negative or positive.Accurate is the intersection of Incor as Neg and Cor as Pos.

Table 8 :
Alternative answers for CSQA question.

Table 9 :
QA pairs generated by KG Triples

Table 10 :
Complete results of alternative answers retrieved from MNLI, ATOMIC and ConceptNet for SIQA question.Reasonable alternatives are in bold.

Table 11 :
Alternative answers from CKGs for COPA question.