Graph Reasoning for Question Answering with Triplet Retrieval

Answering complex questions often requires reasoning over knowledge graphs (KGs). State-of-the-art methods often utilize entities in questions to retrieve local subgraphs, which are then fed into KG encoder, e.g. graph neural networks (GNNs), to model their local structures and integrated into language models for question answering. However, this paradigm constrains retrieved knowledge in local subgraphs and discards more diverse triplets buried in KGs that are disconnected but useful for question answering. In this paper, we propose a simple yet effective method to first retrieve the most relevant triplets from KGs and then rerank them, which are then concatenated with questions to be fed into language models. Extensive results on both CommonsenseQA and OpenbookQA datasets show that our method can outperform state-of-the-art up to 4.6% absolute accuracy.


Introduction
Answering complex questions is a challenging task since it often requires world knowledge and reasoning capability of underlying models (Li et al., 2019;Yasunaga et al., 2021;Zhang et al., 2022).Pretrained language models, e.g.BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), have shown promising results by fine-tuning on downstream question answering tasks.However, world knowledge and reasoning of these models are learned from unstructured data, e.g.Wikipedia text, and are still limited (Li et al., 2019;Petroni et al., 2019).
On the other hand, there exist large-scale knowledge graphs (KGs), e.g.Freebase (Bollacker et al., 2008) and ConceptNet (Speer et al., 2016), capturing world knowledge explicitly by triples to record relations between entities (Zhang et al., 2022).However, how to effectively integrate KGs into language models for question answering is still an open research problem.Li et al. (2019); Yu et al. (2020); Ye et al. (2019); Zhang et al. (2019); Moiseev et al. (2022) focus on utilizing KGs to construct distant supervision signals for continuous pre-training, however, KGs are often dynamic in practice and it is often hard to edit knowledge in models without further training, limiting their usage.Bosselut et al. (2019); Wang et al. (2020) linearize reasoning paths in KGs and train language models on them to generate novel knowledge triplet during inference.However, KGs are discarded after training and language models can hallucinate false world knowledge (Ji et al., 2022).2022) instead first recognize entities in questions and link them to KGs to retrieve subgraphs as additional input besides questions.However, this paradigm constrains retrieved knowledge in local subgraphs and discards more diverse triplets buried in KGs that are disconnected but useful for question answering.In addition, they require extra KG encoders with parameters trained from scratch besides standard language models, limiting model performance when training data is limited.
Recently, there have been growing interests to convert KG as a list of passages represented as natural languages.Oguz et al. (2022); Ma et al. (2021a) convert KG triples into texts and combine these KG-converted texts with heterogeneous resources, e.g.tables and unstructured Wikipedia documents, as passages and achieved state-of-the-art performance for open-domain question answering.Li et al. (2022) follow this line of work and utilize KG and unstructured documents for knowledgegrounded dialogue generation.Zha et al. (2021) linearize reasoning paths in KGs for relation prediction and achieve state-of-the-art performance.
In this paper, we propose to conduct reasoning over KGs for question answering with triplet  (2021).The overall pipeline of proposed method is shown in Figure 1.Specifically, we first linearize KG into triplets and convert them into passages by templates and directly retrieve the most relevant ones by questions with both sparse BM25 (Robertson et al., 1994) and Dense Passage Retriever (DPR) (Karpukhin et al., 2020).We then rerank these passages by pre-trained cross-encoders (Reimers and Gurevych, 2019).Finally, most relevant passages and questions are linearly concatenated and fed into pre-trained language model for question answering.This paradigm has several advantages compared to recent state-of-the-art (Yasunaga et al., 2021;Zhang et al., 2022;Jiang et al., 2022;Wang et al., 2022): (1) it is simple yet effective and can outperform state-of-the-art complicated question answering systems up to 4.6% absolute accuracy (2) it does not need extra KG encoders trained from scratch, e.g.GNNs (Scarselli et al., 2009;Veličković et al., 2018), and simply fuses knowledge passages and questions for question answering by standard language models.

Problem setup
We focus on multi-choice question answering (MCQA) tasks requiring model reasoning capability.Specifically, for each instance in a MCQA dataset, we have a question q and a candidate choice set We also assume that we have access to a knowledge graph G, which provides possibly relevant knowledge to answer each question.Given an example (q, C) and a knowledge graph G, we aim to find the correct answer c ⋆ ∈ C.

Knowledge graphs as passage corpus
Knowledge graph G can be represented as a list of triplets P .For each triplet p ∈ P , we convert it into natural language passage d by templates so that relevant knowledge can be better retrieved.Specifically, for each triplet p ∈ P , it has a head entity h, a relation r and a tail entity t.We map r into r p and d is formed by linearly concatenating <h, r p , t>.For example, if we have a triplet <"hair brush", "AtLocation", "hair">, we map "AtLocation" into "is at location of" and form passage d as "hair brush is at location of hair".Consequently, knowledge graph G is converted into passage corpus D.

Hybrid passage retrieval
We retrieve passages from corpus D for an MCQA example (q, C) with hybrid (i.e., both sparse and dense) retrievers since they are complementary (Karpukhin et al., 2020;Ma et al., 2021b).For sparse retriever, we utilize BM25 to index D and (q, c i ) to retrieve N passages from it.For dense retriever, we use DPR due to its strong performance in open domain question answering (Karpukhin et al., 2020).DPR embeds queries with question encoder and passages with passage encoder into low dimensional dense vectors, and retrieval can be efficiently done through FAISS library (Johnson et al., 2021) on GPUs.Similar to sparse retriever, we utilize (q, c i ) to retrieve N passages from corpus D for DPR.The total number of passages returned by BM25 and DPR is 2N .

Reranking
We further rerank 2N passages retrieved by hybrid retriever with pre-trained cross-encoder (Reimers and Gurevych, 2019).Specifically, for each passage p among retrieved 2N passages, we concatenate query (q, c i ) and passage p as the input of pre-trained cross-encoder.For each input, it will output a scalar value between 0 to 1.These scalar values are then used as reranking scores for 2N passages.

Language model reasoning
After reranking, we choose top K passages P K and concatenate them along with question q and choice c i , which we cast as input of pre-trained language model (PLM).For input <q, c i , P K >, PLM will output contextual representation vector h i , which is then fed into a multi-layer perceptron (MLP) to output a scalar value s i , where s i is the prediction score of choice c i to be correct.During training, we calculate score s i for each choice c i ∈ C and normalize them with softmax function.After that, models are trained to maximize scores of correct choices with standard cross-entropy loss between predictions and ground truth labels.During inference, we calculate score s i for each choice c i ∈ C and select the one with the highest score as the predicted answer of question q.

Experimental setups
We evaluate our method on two question answering datasets requiring model reasoning capability.
(1) CommonsenseQA (Talmor et al., 2019) is a 5-way multi-choice question answering dataset that requires common sense reasoning.Since its test set is not publicly available, we report in-house split (Lin et al., 2019) for comparisons with baselines.
(2) OpenbookQA is a 4-way multi-choice question answering dataset requiring multi-hop reasoning on scientific knowledge (Mihaylov et al., 2018).It has 4957/500/500 questions for training/development/test set split, respectively, and we report results on its test set.
Knowledge graph.For knowledge graph, we utilize ConceptNet (Speer et al., 2016), which is a multi-relational and multi-lingual general knowledge graph storing world common sense knowledge.We first extract English triplets, clean them following (Yasunaga et al., 2021) and convert them into natural language sentences as described in section 2.2, resulting in 2,180,391 passages after data prepossessing.We defer details of relation mapping into Appendix A.
Retrievers.For sparse retriever, we utilize implementation of BM25 from rank-bm25 python package1 with default hyperparameters.For dense retriever, we utilize official pre-trained checkpoint2 from DPR github repository3 .
Reranking.We rerank retrieved passages from BM25 and DPR using pre-trained cross-encoder checkpoint from sentence-transformers package4 .
Specifically, for Common-senseQA dataset, we use pre-trained checkpoint cross-encoder/ms-marco-MiniLM-L-12-v2 on MS MARCO dataset for passage ranking5 while for OpenbookQA dataset, we use pre-trained checkpoint cross-encoder/stsb-roberta-large on semantic textual similarity task (Cer et al., 2017).2022), we utilize RoBERTalarge (Liu et al., 2019) to reason over passages and questions although our framework is modelagnostic.Specifically, question q, choice c i and passage list P K are linearly concatenated with special tokens among them and fed into models detailed in section 2.5 to predict choice score.
We defer more implementation and training details of our method into Appendix B.
Table 1: Performance comparison in accuracy (%) on both CommonsenseQA and OpenBookQA datasets.We report the average results over three random seeds along with standard deviation on IHdev and IHTest (Lin et al., 2019) for CommonsenseQA dataset and test set performance on OpenbookQA dataset.Best results are bold and second best ones are underlined.†: results from Wang et al. (2022).⋆: results from their original papers.¶: results from Yasunaga et al. (2022).
access to extra KG to show the effectiveness of our method.For all experiments in this work, we utilize accuracy (%) as our evaluation metric.

Main results
As shown in Table 1, our method can consistently outperform state-of-the-art on both Common-senseQA and OpenbookQA datasets.For Common-senseQA, our method' test performance can outperform fine-tuned RoBERa-large without KG 6.28% absolute accuracy and outperform best baseline GSC with KG 0.49%.On the smaller dataset Open-bookQA, our method' improvement is larger and can outperform fine-tuned RoBERa-large without KG 10.13% absolute accuracy and outperform best baseline GSC with KG 4.60% accuracy.Note that a key difference between our method and RoBERalarge without KG baseline is that our model also takes additional retrieved passages as input without introducing any extra parameters but can outperform various state-of-the-art methods with extra KG encoders.These consistent results indicate that our simple method can integrate knowledge into language models effectively.

Ablation study
We further ablate our method by removing BM25 retriever, DPR retriever and reranking module.Note that when we remove reranking module and we use the average score of BM25 and DPR if the same passage is retrieved; otherwise, following (Ma et al., 2021b)  When removing BM25, results on Common-senseQA and OpenbookQA drop significantly up to 3.52%.When removing DPR, the result on Com-monsenseQA drops slightly while the result on OpenbookQA drops 3.20%.These results show that BM25 and DPR are complementary, aligning with Ma et al. (2021b).Similarly, when further removing reranking module, model performance on CommonsenseQA drops 1.10% and on Open-bookQA drops 4.00% accuracy.These consistent results show the effectiveness of reranking passages retrieved from hybrid retriever.In addition, the second best model shown in our ablation study can still achieve strong performance and outperform best baseline GSC.These consistent results indicate that our model benefits from the combination of sparse and dense retrievers, and reranking module, even removing some of them can still have strong performance.

Conclusion
In this paper, we propose a simple but effective method for question answering over knowledge graphs with triplet retrieval.Extensive experiments on two datasets show that our method can consistently outperform state-of-the-art.Ablation study further shows that our model benefits from both reranking module and the combination of sparse and dense retrievers.We believe that our work can inspire future research for question answering over knowledge graphs.

Limitations
Our work is constrained into multi-choice question answering system and limited to common sense reasoning tasks, lacking more exploration in other reasoning tasks, e.g.arithmetic reasoning (Cobbe et al., 2021;Chen et al., 2021), conversational reasoning (Chen et al., 2022) and symbolic reasoning (Wei et al., 2022).We plan to leave these directions as future work.

Ethics Statement
Our work utilizes pre-trained language model and external knowledge graph to build question answering systems.However, pre-trained language models can include biases (Shwartz and Choi, 2020) and knowledge graph, e.g.ConceptNet, has been found to contain representational harms (Mehrabi et al., 2021), which can cause these question answering systems to inherit these potential biases and harms.Therefore, additional procedures, e.g.declining inappropriate inputs and filtering harmful outputs, must be taken before real-world deployment.

Figure 1 :
Figure 1: Overview of our framework.The exemplar KG is from Yasunaga et al. (2021).
Language model reasoning.Following Yasunaga et al. (2021); Zhang et al. (2022); Jiang et al. (2022); Wang et al. ( , if a passage p from BM25 is not in the top N of DPR, we use the lowest score in

Table 2 :
Results of removing BM25, DPR and Reranking module on both CommonsenseQA and Open-bookQA dataset.
DPR' top N , and vice versa.Ablation results on IHTest (Lin et al., 2019) of CommonsenseQA and test set of OpenbookQA are shown in Table 2.