Decker: Double Check with Heterogeneous Knowledge for Commonsense Fact Verification

Commonsense fact verification, as a challenging branch of commonsense question-answering (QA), aims to verify through facts whether a given commonsense claim is correct or not. Answering commonsense questions necessitates a combination of knowledge from various levels. However, existing studies primarily rest on grasping either unstructured evidence or potential reasoning paths from structured knowledge bases, yet failing to exploit the benefits of heterogeneous knowledge simultaneously. In light of this, we propose Decker, a commonsense fact verification model that is capable of bridging heterogeneous knowledge by uncovering latent relationships between structured and unstructured knowledge. Experimental results on two commonsense fact verification benchmark datasets, CSQA2.0 and CREAK demonstrate the effectiveness of our Decker and further analysis verifies its capability to seize more precious information through reasoning.


Introduction
Commonsense question answering is an essential task in question answering (QA), which requires models to answer questions that entail rich world knowledge and everyday information.The major challenge of commonsense QA is that it not only requires rich background knowledge about how the world works, but also demands the ability to conduct effective reasoning over knowledge of various types and levels (Hudson and Manning, 2018).Recently, there emerges a challenging branch of commonsense QA: commonsense fact verification, which aims to verify through facts whether a given commonsense claim is correct or not (Onoe et al., Figure 1: An example from CSQA2.0 (Talmor et al., 2022).Given the question, we perform a double check between the heterogeneous knowledge (i.e., KG and facts) and aim to derive the answer by seizing the valued information through reasoning.2021; Talmor et al., 2022).Different from previous multiple-choice settings which contain candidate answers (Talmor et al., 2019), commonsense fact verification solely derives from the question itself and implements reasoning on top of it (Figure 1).Therefore, it poses a novel issue of how to effectively seize the useful and valuable knowledge to deal with commonsense fact verification.
One of the typical methods is to make direct use of knowledge implicitly encoded in pre-trained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;He et al., 2021), which have proved to be useable knowledge bases (Petroni et al., 2019;Bosselut et al., 2019).The knowledge in PLMs is gained during the pre-training stage through mining large-scale collection of unstructured text corpora.Nevertheless, the sore spot lies in that it is natural for human brains to project our prior world knowledge onto the answers facing the commonsense questions (Lin et al., 2019;Choi, 2022), whereas it is tough for PLMs to learn commonsense knowl-edge that is implicitly stated in plain texts from corpora (Gunning, 2018).
To strengthen PLMs to perform commonsense QA, there is a surging trend of methods equipping language models with different levels of external knowledge, encompassing structured knowledge such as knowledge graphs (KG) (Lin et al., 2019;Yan et al., 2021;Yasunaga et al., 2021;Zhang et al., 2022b) and unstructured knowledge such as text corpus (Lin et al., 2021;Yu et al., 2022).While the KG-based methods yield remarkable performances on commonsense QA recently, they are more suitable and adaptive for multiple-choice settings because they lay emphasis on discovering connected patterns between the question and candidate answers.For example, to answer a question crabs live in what sort of environment?with candidate answers saltwater, galapagos and fish market, the KG-based methods manage to capture the path crab-sea-saltwater in KG, leading to a correct prediction.Nonetheless, they encounter a bottleneck when dealing with commonsense fact verification.Figure 1 shows an example: when asked whether july always happens in the summer around the worlds, the KG-based methods have a tendency to detect a strong link between july and summer, which may persuade the model to deliver the wrong prediction.
In general, there are two major limitations in previous studies.On one hand, structured knowledge abounds with structural information among the entities but suffers from sparsity and limited coverage.On the other hand, unstructured knowledge provides rich and broad context-aware information but undergoes noisy issues.These two kinds of knowledge can be naturally complementary to each other.However, most existing works focus on either structured or unstructured external knowledge but fail to exploit the benefits of heterogenous knowledge simultaneously.As the example in Figure 1 shows: if we rely only on the structured knowledge in KG, we tend to derive that july and summer are strongly correlated, with an extremely weak relationship between summer and winter.Similarly, if we focus only on the textual facts, we are more inclined to focus on the fact in grey, as it describes more information about summer in july.As a consequence, uncovering latent relationships among heterogeneous knowledge helps bridge the gap and yield more valuable and useful information.Motivated by the above ideas, we propose DECKER, a commonsense fact verifier that bridges heterogeneous knowledge and performs a double check based on interactions between structured and unstructured knowledge.Our proposed DECKER works in the following steps: (i) firstly, it retrieves heterogeneous knowledge including a KG subgraph and several relevant facts following prior works (Zhang et al., 2022b;Izacard et al., 2022); (ii) secondly, it constructs an integral graph with encoded question and facts and then employs relational graph convolutional networks (R-GCN) to reason and filter over the heterogenous knowledge; (iii) lastly, it adopts a multi-head attention pooling mechanism to obtain a final refinement of enriched knowledge representation and combines it with the question representation for downstream tasks.
Our contributions are summarized as follows: (i) For the concerned commonsense fact verification task, we initialize the research that simultaneously takes heterogeneous knowledge into account.
(ii) We propose a novel method in terms of R-GCN to construct an integral graph that executes a double check between structured and unstructured knowledge and better uncovers the latent relationships between them.
(iii) Experimental results on two commonsense fact verification benchmarks show the effectiveness of our approach, verifying the necessity and benefits of heterogeneous knowledge integration.

Commonsense QA
Commonsense QA is a long-standing challenge in natural language processing as it calls for intuitive reasoning about real-world events and situations (Davis and Marcus, 2015).As a result, recent years have witnessed a plethora of research on developing commonsense QA tasks, including SWAG (Zellers et al., 2018), Cosmo QA (Huang et al., 2019), HellaSwag (Zellers et al., 2019), CSQA (Talmor et al., 2019), SocialIQa (Sap et al., 2019) and PIQA (Bisk et al., 2020).However, these tasks primarily attend to multiple-choice settings, so that there usually exist potential reasoning paths which explicitly connect the question with candidate answers.This may cause the models to be susceptible to shortcuts during reasoning (Zhang et al., 2022b).Therefore, a novel branch of commonsense QA: commonsense fact verification has emerged to further exploit the limits of reasoning models, such as CREAK (Onoe et al., 2021) andCSQA2.0 (Tal-mor et al., 2022).Unlike previous multiple-choice settings, commonsense fact verification needs the models to be granted richer background knowledge and higher reasoning abilities based on the question alone.Hence, our work dives into commonsense fact verification and conducts experiments on two typical benchmarks: CREAK and CSQA2.0.

Knowledge-enhanced Methods for
Commonsense QA Despite the impressive performance of PLMs on many commonsense QA tasks, they struggle to capture sufficient external world knowledge about concepts, relations and commonsense (Zhu et al., 2022).Therefore, it is of crucial importance to introduce external knowledge for commonsense QA.Currently, there are two major lines of research based on the property of knowledge: structured knowledge (i.e., knowledge graphs) and unstructured knowledge (i.e., text corpus).The first research line strives to capitalize on distinct forms of knowledge graphs (KG), such as Freebase (Bollacker et al., 2008), Wikidata (Vrandečić and Krötzsch, 2014), ConceptNet (Speer et al., 2017), ASCENT (Nguyen et al., 2021) and ASER (Zhang et al., 2022a).Commonsense knowledge is thus explicitly delivered in a triplet form with relationships between entities.An initial thread of works endeavors to discover potential reasoning paths between the question and candidate answers under multiple-choice settings, which have shown remarkable advances in structured reasoning and question answering.For example, KagNet (Lin et al., 2019) utilizes a hierarchical path-based attention mechanism and graph convolutional networks to cope with relational reasoning.MHGRN (Feng et al., 2020) modifies from graph neural networks to make it adaptable for multi-hop reasoning while HGN (Yan et al., 2021) conducts edge generation and reweighting to find suitable paths more efficiently.JointLK (Sun et al., 2022) performs joint reasoning between LM and GNN and uses the dynamic KGs pruning mechanism to seek effective reasoning.Furthermore, other research optimizes by enhancing the interaction between raw texts of questions and KG to achieve better performance and robustness.QA-GNN (Yasunaga et al., 2021) designs a relevance scoring to make the interaction more effective, whereas GreaseLM (Zhang et al., 2022b) leverages multiple layers of modality interaction operations to achieve deeper interaction.
Nevertheless, the scope of commonsense knowledge is infinite, far beyond a knowledge graph defined by a particular pattern.
The second research line attempts to make use of unstructured knowledge with either prompting methods (Lal et al., 2022;Qiao et al., 2023) or information retrieval techniques (Lewis et al., 2020a).Maieutic prompting (Jung et al., 2022) infers a tree of explanations through abductive and recursive prompting from generations of large language models (LLMs), which incurs high inference costs due to paywalls imposed by LLMs providers.Dr-Fact (Lin et al., 2021) retrieves the related facts step by step through an iterative process of differentiable operations and further enhances the model with an external ranker.Talmor et al. (2020) employs regenerated data to train the model to reliably perform systematic reasoning.RACo (Yu et al., 2022) utilizes a retriever-reader architecture as the backbone and retrieves documents from a largescale mixed commonsense corpus.Xu et al. (2021) extracts descriptions of related concepts as additional input to PLMs.However, these works mainly focus on homogeneous knowledge and reason on top of it, ignoring the need to fuse multiple forms of knowledge.Unlike previous works, our model is dedicated to intuitively modeling the relations between heterogeneous knowledge, bridging the gap between them, and filtering the more treasured knowledge by exploiting their complementary nature, in an inference-cost-free pattern.
Besides, there are some works taking heterogeneous knowledge into account to deal with commonsense reasoning.For instance, Lin et al. (2017) mines various types of knowledge (including event narrative knowledge, entity semantic knowledge and sentiment coherent knowledge) and encodes them as inference rules with costs to tackle commonsense machine comprehension.Nevertheless, this work is principally based on semantic or sentiment analysis at the sentence level, seeking knowledge enrichment at various levels of granularity.Our approach, however, is more concerned with extending external sources of knowledge and creating connections between heterogeneous knowledge from distinct sources so that they may mutually filter each other.

Methodology
This section presents the details of our proposed approach.Figure 2 gives an overview of its archi- tecture.Our approach, DECKER, consists of three major modules: (i) Knowledge Retrieval Module which retrieves heterogeneous knowledge based on the input question; (ii) Double Check Module which merges information from structured and unstructured knowledge and makes a double check between them; (iii) Knowledge Fusion Module which combines heterogeneous knowledge together to obtain a final representation.

Knowledge Retrieval Module
KG Retriever Given a knowledge graph G and an input question q, the goal of the KG Retriever is to retrieve a question-related sub-graph G q sub from G. Following previous works (Lin et al., 2019;Yasunaga et al., 2021;Zhang et al., 2022b), we first execute entity linking to G to extract an initial set of nodes V init .We then obtain the set of retrieved entities V sub by adding any bridge entities that are in a 2-hop path between any two linked entities in V init .Eventually, the retrieved subgraph G sub is formed by retrieving all the edges that join any two nodes in V sub .
Fact Retriever Given a large corpus of texts containing K facts and an input question q, the objective of the fact retriever is to retrieve the top-k facts relevant to q.Following Contriever (Izacard et al., 2022) which is an information retrieval model pre-trained using the MoCo contrastive loss (He et al., 2020) and unsupervised data only, we em-ploy a dual-encoder architecture where the question and facts are encoded independently by a BERT base uncased model (Huang et al., 2013;Karpukhin et al., 2020).For each question and fact, we apply average pooling over the outputs of the last layer to obtain its corresponding representation.Then a relevance score between a question and a fact is obtained by computing the dot product between their corresponding representations.
More precisely, given a question q and a fact f i ∈ {f 1 , f 2 , . . ., f K }, we encode each of them independently using the same model.The relevance score r(q, f i ) between a question q and a fact f i is the dot product of their resulting representations: where ⟨, ⟩ denotes the dot product operation and E θ denotes the model parameterized by θ.
After obtaining the corresponding relevance scores, we select k facts F = f 1 q , f 2 q , . . ., f k q , whose relevance scores r(q, f ) are top-k highest among all K facts for each question q.

Double Check Module
Language Encoding Given a question q and a set of retrieved facts F = f 1 q , f 2 q , . . ., f k q , we deliver their corresponding sets of tokens Q = q 1 , q 2 , . . ., q t and f i q = t 1 i , t 2 i , . . ., t o i i into a PLM, where t and o i are the lengths of the question and fact sequence f i q , respectively.We obtain their  representations independently by extracting [CLS] inserted at the beginning: where d denotes the hidden size defined by PLM.
Graph Construction Figure 3 gives an example of the constructed graph, which is dubbed as integral graph.Given a question q, a subgraph G q sub extracted from KG and several retrieved facts F = f1 q , f 2 q , . . ., f k q , we construct an integral graph denoted as G = (V, E, R).Here V = V q ∪ V c ∪ V f is the set of entity nodes, where V q , V c and V f denote the question node (orange in Figure 3), concept nodes (green in Figure 3) and fact nodes (purple in Figure 3), respectively; E is the set of edges that connect nodes in V; R is a set of relations representing the type of edges in E. In the integral graph, we define four types of edges 1 : • concept-to-fact edges: (n c , r c2f , n f ); • concept-to-concept edges: (n c , r c2c , n c ); • question-to-fact edges: (n q , r q2f , n f ); • question-to-concept edges: (n q , r q2c , n c ), where For question-to-concept and question-to-fact edges which are bidirectional, we connect the question node with all the other nodes in the integral graph with regard to enhancing the information flow between the question and its related heterogeneous knowledge.For concept-to-concept edges which are directional, we keep the structured knowledge extracted from KG and do not distinguish the multiple relations inside the subgraph, as our approach mainly concentrates on effective reasoning over heterogeneous knowledge.For concept-to-fact edges, we use string matching and add a bidirectional edge (n c , r c2f , n f ) between n c ∈ V c and n f ∈ V f with r c2f ∈ R if the concept n c can be captured in the fact n f .For instance, there should exist an edge between the concept soup and the fact soup is primarily a liquid food.In this way, the noisy and peripheral information is filtered whereas the relevant and precious knowledge is intensified.
Afterward, we initialize the node embeddings in the integral graph G.For the concept nodes, we follow the method of prior work (Feng et al., 2020;Zhang et al., 2022b) and employ pre-trained KG embeddings for the matching nodes, which is introduced in Section 4.2.Then the pre-trained embeddings go through a linear transformation to align the dimension: where m denotes the number of concept nodes in the sub-graph, d c denotes the hidden size of pretrained KG embeddings, W c ∈ R dc×d and b c ∈ R d are trainable transformation matrices and bias vectors respectively.For the question nodes and fact nodes, we inject the corresponding encoded results from PLM in Equation 2. Consequently, we obtain the initial node embeddings N (0) ∈ R (1+k+m)×d for the integral graph: Graph Reasoning As our integral graph G is a multi-relational graph where distinct edge types serve as varied information exchange between disparate knowledge, the message-passing process from a source node to a target node should be aware of its relationship, i.e., relation type of the edge.For example, the concept-to-fact edges help to implement a double check and filtering between concepts and facts whereas the concept-to-concept edges assist in discovering the structured information.To this end, we adopt relational graph convolutional network (R-GCN) (Schlichtkrull et al., 2018) to perform reasoning on the integral graph.
In each layer of R-GCN, the current node representations N (l) are fed into the layer to perform a round of information propagation between nodes in the graph and yield novel representations: (5) More precisely, the R-GCN computes node representations h (l+1) i ∈ N (l+1) for each node n i ∈ V by accumulating and inducing features from neighbors via message passing: ) where R is the set of relations, which corresponds to four edge types in our integral graph.N r i denotes the set of neighbors of node n i , which are connected to n i under relation r, and c i,r is a normalization constant.W (l) r and W (l) 0 are trainable parameter matrices of layer l. σ is an activated function, which in our implementation is GELU (Hendrycks and Gimpel, 2016).
Finally, we access the graph output through an L-layer R-GCN: L) .(7)

Knowledge Fusion Module
Multi-head Attention Pooling Since the acquired heterogeneous knowledge is leveraged to help answer the question, further interaction between the question and the knowledge is needed to refine the double-checked knowledge.Following the idea of Zhang et al. (2022b), we introduce a multi-head attention pooling mechanism (MHA) to ulteriorly gather the question-related information: where R hdv×d are trainable parameter matrices, h is the number of attention heads.d q , d k , d v denote the hidden sizes of the query vector, key vector and value vector, respectively.Specifically, we employ the initial question embedding from PLM as the query and feed it into MHA together with the graph-encoded representations of facts and concepts2 .We thus derive the pooled knowledge representation: Answer Prediction In the end, we concatenate the initial question embeddings q enc , the pooled knowledge representation K a and the enriched question representation q (L) enc and deliver it into a predictor to get a final answer prediction: where the predictor is a two-layer MLP with a tanh activation of size (3d, d, nlabel), nlabel denotes the number of labels, which equals to 2 in our commonsense fact verification setting.The model is optimized using the cross entropy loss.
CommonsenseQA2.0 is a commonsense reasoning dataset collected through gamification.It includes 14,343 assertions about everyday commonsense knowledge.We use the original train / dev / test splits from Talmor et al. (2022).
CREAK is a dataset for commonsense reasoning about entity knowledge.It is made up of 13,000 English assertions encompassing 2,700 entities that are either true or false, in addition to a small contrast set.Each assertion is generated by a crowdworker based on a Wikipedia entity, which can be named entities, common nouns and abstract concepts.We perform our experiments using the train / dev / test / contrast splits from Onoe et al. (2021).

Experimental Setup
Retrieval Corpus We leverage the English Wikipedia dump as the retrieval corpus.For preprocessing Wikipedia pages, we utilize the same method as described in Karpukhin et al. (2020); Lewis et al. (2020b).We divide each Wikipedia page into separate 100-word paragraphs, amounting to 21,015,324 facts in the end.2020), which consists of four steps: (1) it first converts knowledge triples in the KG into sentences using pre-defined templates for each relation; (2) it then feeds these sentences into PLM to compute embeddings for each sentence; (3) after that, it extracts all token representations of the entity's mention spans in these sentences; (4) it finally mean pools over these representations and projects this pooled representation.
Implementation Details Our model is implemented using Pytorch and based on the Transformers Library (Wolf et al., 2020).We finetune DeBERTa-V3-Large as the backbone pretrained language model for DECKER, and the hyperparameter setting generally follows DeBERTa (He et al., 2021).We set the layer number of the R-GCN as 3, with a dropout rate of 0.1 applied to each layer.The number of retrieved facts is set to 5 due to the trade-off for computation resources.The maximum input sequence length is 256.The initial learning rate is selected in {5e-6, 8e-6, 9e-6, 1e-5} with a warm-up rate of 0.1.The batch size is selected in {8, 16}.We run up to 20 epochs and select the model that achieves the best result on the development dataset.

Main Results
Table 1 presents the detailed results on two commonsense fact verification benchmarks: CREAK and CSQA 2.0.We compare our model with several baseline methods, which represent distinct knowledge-enhanced methods.UNICORN (Lourie et al., 2021) is instilled with external commonsense knowledge during the pre-training stage.GreaseLM (Zhang et al., 2022b)   stage.RACo (Yu et al., 2022) incorporates unstructured knowledge by constructing a commonsense corpus on which its retriever is trained3 .Besides, we also compare our model with strong PLMs such as T5-3B (Raffel et al., 2022).
The results indicate that our model DECKER outperforms the strong baseline methods and achieves comparable results on the test set of CREAK.Besides, our model surpasses the current state-of-theart model RACo on the contrast set of CREAK.Moreover, we observe that our model is lightweight and competitive without a considerable number of parameters and mixed data from multiple tasks during training, thus showing the strength and superiority of our model in various dimensions.

Ablation Study
We conduct a series of ablation studies under the same set of hyperparameters to determine the contributions of key components in our model.Results  Graph Construction One of the crucial components of our model is graph construction, where the integral graph contains three types of nodes and four types of edges.We ablate the question node and remove all the edges connected with it.
The results show that the removal hurts the performance.Furthermore, we dive into the edge analysis.
We first treat all edges as the same type instead of four types, which witnesses a significant drop in performance.Our intuition is that effective reasoning among heterogenous knowledge should attend to edge types because they symbolize the distinct emphases during reasoning.We then erase each kind of edge respectively.Notably, the absence of concept-to-fact edges degrades the performance badly, suggesting the necessity of double-checking between heterogeneous knowledge.

Methods of Pooling
During the period of aggregating the graph output, we analyze the influence of different pooling methods, including max pooling, mean pooling, attention pooling and multi-head attention pooling.These pooling methods can be divided into two categories: those involving and those ignoring the interaction with the question.We compare the models with the same hyper-parameters on the development set of CREAK.Results in Table 4 demonstrate that the interaction process promotes the model performance, which may reveal that the graph reasoning executes more on the information flow between different levels of knowledge and the augmented inquiry about the initial question implements a final refinement of enriched knowledge.As shown in Table 4, employing multi-head attention pooling presents the best performance.

Interpretability: Case Study
In order to further explore the mechanism and get more intuitive explanations of our model, we select a case from CREAK in which the baseline model fails but our model succeeds.In addition, we analyze the node attention weights related to the question induced in MHA mechanism.Figure 4 shows that our DECKER can well bridge the reasoning between heterogeneous knowledge, thus leading to better filtering the noisy material and maintaining the beneficial information.Concretely, given the claim whales can breathe underwater, our model first extracts relevant structured and unstructured knowledge and then conducts reasoning over them.After reasoning, our model pays close attention to the concepts including breathe, whale, air, surface and the fact whales are air-breathing mammals who must surface to get the air they need, as shown in the attention heatmap.We can see that our model has the capability of manipulating heterogeneous knowledge to answer the questions.

Conclusion
In this work, we propose DECKER, a commonsense fact verification model that bridges heterogeneous knowledge and performs a double check based on the interactions between structured and unstructured knowledge.Our model not only uncovers latent relationships between heterogeneous knowledge but also conducts effective and fine-grained knowledge filtering of the knowledge.Experiments on two commonsense fact verification benchmarks (CSQA2.0 and CREAK) demonstrate the effectiveness of our approach.While most existing works focus on fusing one specific type of knowledge, we open up a novel perspective to bridge the gap between heterogeneous knowledge to gain more comprehensive and enriched knowledge in an intuitive and explicit way.

Limitations
There are three limitations.First, our model requires the retrieval of relevant structured and unstructured knowledge from different knowledge sources, which can be time-consuming.Using cosine similarity over question and fact embeddings can be a bottleneck for the model performance.Second, our model focuses on rich background knowledge but might ignore some inferential knowledge, which can be acquired from other sources such as Atomic.Third, our model might not be applicable to low resources languages where knowledge graphs are not available.

Figure 2 :
Figure2: Overview of our approach, which consists of three components: Knowledge Retrieval Module (left), Double Check Module (middle), and Knowledge Fusion Module (right).Given an input question, KG retriever and fact retriever extract relevant local KG and facts (Knowledge Retrieval Module); then heterogeneous knowledge including entities in KG and facts are enhanced (Double Check Module); finally, heterogeneous knowledge is merged to deduce the final answer prediction (Knowledge Fusion Module).

Figure 3 :
Figure 3: An example of the constructed integral graph.

Figure 4 :
Figure 4: An example showing how our model works to achieve the correct answer, in which our baseline fails.Texts in purple denote facts and texts in green denote concepts.

Table 1 :
Experimental results on the CREAK and CSQA2.0 datasets.The evaluation metric is accuracy (acc).

Table 2 :
Ablation study of our model for components in Knowledge Retrieval and Graph Construction modules on the CREAK development set.

Table 3 :
Results on the CSQA2.0 and CREAK development sets.The evaluation metric is accuracy (acc).