Pay More Attention to Relation Exploration for Knowledge Base Question Answering

Knowledge base question answering (KBQA) is a challenging task that aims to retrieve correct answers from large-scale knowledge bases. Existing attempts primarily focus on entity representation and final answer reasoning, which results in limited supervision for this task. Moreover, the relations, which empirically determine the reasoning path selection, are not fully considered in recent advancements. In this study, we propose a novel framework, RE-KBQA, that utilizes relations in the knowledge base to enhance entity representation and introduce additional supervision. We explore guidance from relations in three aspects, including (1) distinguishing similar entities by employing a variational graph auto-encoder to learn relation importance; (2) exploring extra supervision by predicting relation distributions as soft labels with a multi-task scheme; (3) designing a relation-guided re-ranking algorithm for post-processing. Experimental results on two benchmark datasets demonstrate the effectiveness and superiority of our framework, improving the F1 score by 5.7% from 40.5 to 46.3 on CWQ and 5.8% from 62.8 to 68.5 on WebQSP, better or on par with state-of-the-art methods.


Introduction
Given a question expressed in natural language, knowledge base question answering (KBQA) aims to find the correct answers from a large-scale knowledge base (KB), such as Freebase (Bollacker et al., 2008), Wikipedia (Vrandečić and Krötzsch, 2014), DBpeidia (Auer et al., 2007), etc.For example, the question "Who is Emma Stone's father?" can be answered by the fact of "(Jeff Stone, person.parents,Emma Stone)".The deployment of KBQA can significantly enhance a system's knowledge, improving performance for applications such as dialogue systems and search engines.Figure 1: An example of KBQA process.The reasoning begins with the red node and passes through similar entities, which are defined as entities that have similar relations as shown in the upper right box.Besides, key reasoning relations whose tokens (x i ) hold overlap between given questions (t i ) are important for reasoning.
Early attempts on KBQA (Min et al., 2013;Zhang et al., 2018;Xu et al., 2019) mostly focus on transferring given questions into structured logic forms, which are strictly constrained by the consistent structure of parsed query and KB.To overcome the limitation of the incompleteness of KB, many approaches (Xiong et al., 2019;Deng et al., 2019;Lan et al., 2021) have been developed that aim to map questions and their related KB entities and relations into embeddings, and define the reasoning process as a similarity retrieval problem, which is called IR-based method.Additionally, some studies (Gao et al., 2022;Liu et al., 2023;Ge et al., 2022) have attempted to learn relation embeddings and then incorporate surrounding relations to represent entities, which successfully reduces the number of parameters needed for the model.
However, most of these works (Han et al., 2021) primarily focus on final answer reasoning and the representation of entities, while few explore the full utilization of relations in KB.Additionally, for answer reasoning, the supervision signal provided is also only from entities, while we believe that the relations also play an important role in determining the reasoning path and the answer choosing.
We propose a new framework, called Relation-Enhanced KBQA (RE-KBQA), to investigate the potential use of relations in KBQA by utilizing an embedding-fused framework.The proposed framework aims to study the role of relations in KBQA in the following three aspects: Relations for entity representation.We find that similar entities with similar surrounding relations (e.g., the three green circles in the upper right of Figure 1) play an important role in reasoning.To distinguish them, we introduce QA-VGAE, a question-answering-oriented variational graph auto-encoder, which learns relation weights through global structure features and represents entities by integrating surrounding relations.
Relations for extra supervision.Multi-hop reasoning is often hindered by weak supervision, as models can only receive feedback from final answers (He et al., 2021).To overcome this limitation, we propose a multi-task scheme by predicting the relation distribution of the final answers as additional guidance, using the same reasoning architecture and mostly shared parameters.As illustrated in Figure 1, the proposed scheme requires the prediction of both the answer "Haitian Creole" and its surrounding relation distribution.
Relations for post-processing.We propose a stem-extraction re-ranking (SERR) algorithm to modify the confidence of candidates, motivated by the fact that relations parsed from given questions are empirically associated with strong reasoning paths.As depicted in the bottom of Figure 1, relations that overlap with a given question will be marked as key reasoning relations, and their confidence will be increased empirically.This allows for re-ranking and correction of the final answers.
In general, our contributions can be summarized as follows.(1) We propose a novel method named Relation Enhanced KBQA (RE-KBQA) by first presenting QA-VGAE for enhanced relation embedding.(2) We are the first to devise a multi-task scheme to implicitly exploit more supervised signals.(3) We design a simple yet effective postprocessing algorithm to correct the final answers, which can be applied to any IR-based method.(4) Lastly, we conduct extensive experiments on two challenging benchmarks, WebQSP and CWQ to show the superiority of our RE-KBQA over other competitive methods.Our code and datasets are publicly available on Github1 .
Multi-task Learning for KBQA.Multitask learning can boost the generalization capability on a primary task by learning additional auxiliary tasks (Liu et al., 2019) and sharing the learned parameters among tasks (Hwang et al., 2021;Xu et al., 2021).Many recent works have shown impressive results with the help of multi-task learning in many weak supervised tasks such as visual question answering (Liang et al., 2020;Rajani and Mooney, 2018), sequence labeling (Rei, 2017;Yu et al., 2021), text classification (Liu et al., 2017;Yu et al., 2019) and semantic parsing (Hershcovich et al., 2018).In KBQA, auxiliary information is often introduced in the form of artificial "tasks" relying on the same data as the main task (Hershcovich et al., 2018;Ansari et al., 2019;Gu et al., 2021), rather than independent tasks.This assists the reasoning process and proves to be more effective for the main task.To the best of our knowledge, we are the first to propose a multi-task to assist KBQA by using mostly shared parameters among tasks, for a balance of effectiveness and efficiency.3 Problem Formulation Knowledge Base (KB).A knowledge base usually consists of a huge amount of triples: G = {⟨e, r, e ′ ⟩|(e, e ′ ) ∈ ξ, r ∈ R}, where ⟨e, r, e ′ ⟩ denotes a triple with head entity e, relation r and tail entity e ′ .ξ and R mean the sets of all entities and relations, respectively.To apply the triples to downstream task, the entities and relations should be firstly embedded as d-dimensional vectors: Knowledge Base Question Answering (KBQA).Our dataset is formed as question-answer pairs.Let Q represents the set of given questions and each question q is composed of separated tokens, where Q = {q ∈ Q|q = x 1 , x 2 , ..., x n }.Let A (⊆ ξ) represents the correct answers of Q.Thus, the dataset is formulated as D = {(Q, A)|(q 1 , a 1 ), (q 2 , a 2 ), ..., (q m , a m )}.To reduce the complexity of reasoning process, we extract question-related head entities e h from q and generate an associated subgraph g sub (∈ G sub ) within multi-hops walking from e h .Thus, the goal of KBQA is transformed to reason the candidates c (⊆ ξ) of the highest confidence from g sub , which can be formalized as: where f θ (•) and r ϕ (•) denote the representation and reasoning network, respectively.

Our Approach
As discussed in Section 1, we consider three aspects to further boost the performance of KBQA, including (i) the enhancement of the representation capability, especially for similar entities; (ii) a strategy of mining more supervision signals to guide the training; and (iii) a reasoning path correction algorithm to adjust the ranking results.Below, we shall elaborate on our network architecture (RE-KBQA) with our solutions to the above issues.

Architecture Overview
Inspired by the neighborhood aggregation strategy, we employ Neural State Machine (NSM) (He et al., 2021) as our backbone model, where entities are denoted by surrounding relations.We assume that the topic entities and the related subgraph are already achieved by preprocessing; see Section 5.2 for the details. Figure 2 shows the main pipeline of our RE-KBQA.Specifically, given a question q, we first employ a question embedding module to encode it into semantic vector.
Here, for a fair comparison with NSM baseline, we follow (He et al., 2021) to adopt Glove (Pennington et al., 2014) to encode q into embeddings {V j q } n j=1 = Glove(x 1 , x 2 , ..., x n ), which is then mapped to hidden states by LSTM: (2) where we set h ′ as the last hidden state of LSTM to denote question vector and {h j } n j=1 denotes the vector of tokens.After obtaining h ′ and {h j } n j=1 , then we can calculate : where ψ(•) denotes multi-layer percetron function.Then, the semantic vector s (t) at the t-th reasoning step of question q is obtained by: where p(•) denotes score function, and s (0) (∈ R (|d|) ) is initialized randomly.Next, a QA-VGAE enhanced representation module is designed to represent KB elements under the guidance of s (t) .Then, unlike previous works that directly predict final answer via a score function, we introduce a multi-task learning-fused reasoning module to further predict an auxiliary signal (i.e., relation distribution).Note that, though we adopt NSM framework to conduct KBQA task, we concentrate on the representation capability enhancement by identifying similar entities, as well as the multi-task learning via supervision signal mining.At last, to avoid ignoring strong reasoning paths, we further propose a stem-extraction re-ranking algorithm to post-process the predictions of our network.Below, we will present the details of three of our proposed contributed modules.

QA-VGAE Enhanced Representation
Similar entities are defined as entities that are connected mostly by the same edges, and only a small portion of edges are different.For example, as shown in Figure 1, the three nodes marked by dashed circles share almost the same edges, and only the node of "Haiti" holds the relation of "Person.Spoken_language" that is quite important for answering the question.Hence, distinguishing similar entities and identifying key reasoning paths are essential for embedding-fused information retrieval-based methods.Traditional methods like TransE (Bordes et al., 2013) can grasp local information from independent triples within a KB, but fail to capture the inter-relations between adjacent triple facts.Consequently, they tend to have difficulties in distinguishing similar entities.
To alleviate the above problem, we introduce Question Answering-oriented Variational Graph Auto-Encoder (QA-VGAE) module, as is shown in Figure 3, by assigning different weights to reasoning relations, where the weights are learned by VGAE (Kipf and Welling, 2016).Note that, compared with traditional methods like TransE (Bordes et al., 2013), TransR (Lin et al., 2015), and ComplEx (Trouillon et al., 2016), VGAE achieves superior performance in link prediction task.We thus adopt VGAE in our module to learn weights.
The key insight of this module is to fully learn global structure features by executing graph reconstruction task and constraining the representation as normal distribution, thus promoting the relation representation to be more discriminating.Finally, by similarity evaluation of the learned representation, we can obtain the prior probability of relation (PPR) matrix, whose elements denote the conditional probability of relations.
In detail, we first transfer the KB from ⟨e, r, e ′ ⟩ (entity-oriented) to ⟨r, e, r ′ ⟩ (relation-oriented).In this way, we can then learn PPR matrix via a link prediction task by unsupervised learning.Specifically, given the connection degrees X (∈ R |nr|×|nr| ) of a relation and the adjacency A (∈ R |nr|×|nr| ) between relation nodes, where n r denotes the number of relations, we adopt two-layers GCN to learn the mean σ and variance µ of the relation importance distribution, and further compound the relation representation Z as : where ⊕ is compound function.Then, PPR matrix P r is obtained by distribution similarity evaluation: where P r ∈ R |nr|×|nr| .Please refer to Appendix A.1 for loss function L P of QA-VGAE.Next, we denote KB elements as d-dim vectors, V ξ (∈R |ne|×|d| ) as entity vectors and V R (∈ R |nr|×|d| ) as relation vectors, where n e is the number of enities.We denote candidate vectors V C as: where W C ∈ R |nc|×|nr| denotes the surrounding relation matrix of entities and n c denotes number of candidates.
Then, to integrate semantic vectors s (t) of given question and the history vector, we update V c as: where c (∈ V C ) is candidate vector at time step t, σ(•) is the linear layer, [; ] is the concatenation operation, ⊙ is element-wise multiplication, and W r (∈ R |d| ) is the matrix of learnable parameter.

Multi-task Learning-Fused Reasoning
The purpose of this module is to conduct answer reasoning from candidate vector . To this end, we jointly combine the reasoning paths implicitly among candidates by utilizing the Transformer (Vaswani et al., 2017), formalized as: Specifically, motivated by weakly-supervised learning methods, we assume the reasoning process starts from topic entity's surrounding relations S (0) R (initialized along with subgraph generation), and during reasoning, we can easily obtain next surrounding relations' distribution by: where R denotes the surrounding relations of candidates at step t and V is the transpose of V R at step t.Note that, introducing the multi-task will not increase the complexity of our method obviously, since the number of relations is far fewer than that of entities in most cases, and the multitask shares most parameters with the main task.
In this way, there are two optimization goals of KBQA task, i.e., correct answer retrieving and surrounding relations prediction.We predict the final answers' possibilities by: where c is the confidence of predicted answers.Also, the relation distribution confidence p (t) r is: where r are learnable parameters.Then, the answer retriving loss L c and the relation prediction loss L r can be calculated by: where p c ( * ) and p r ( * ) denote the ground truths, KL is the KL divergence.Thus, the final total loss is: where λ denotes a hyper-parameter.

Stem-Extraction Re-Ranking
A limitation of embedding-fused KBQA methods is that the reasoning path is uncontrollable as the complete reasoning path is a blackbox in information retrieval-based methods.For example, in the question "What is the Milwaukee Brewers mascot?", the strongly related path "education.mascot" may be missed due to limited representation capability.However, this weakness can be easily addressed by semantic parsing-based methods by analyzing the semantic similarity of key elements of questions and relations and constraining the reasoning path.Inspired by this observation, we propose a stem-extraction re-ranking (SERR) algorithm for post-processing.The key idea is to stem-match and re-rank the candidates after obtaining candidates and their confidence from our network.
In detail, we design three operators to execute the re-ranking as shown in Algorithm 1: stemmer F(•), modifier M(•), and re-ranker R(•).These operators are used to extract stems from relations or given questions, modify candidates' confidence, and then re-rank the candidates.As shown in Figure 4, given question and candidate predictions, we first use F(•) to process all the relations of freebase relations and questions.Then, we generate a relation candidates pool by matching the stem pool of the question with the relation stems.This allows us to compare the subgraph of the given question with pseudo-facts produced by given topic entities and candidates, respectively.Finally, according to the comparison, M(•) and R(•) are employed to conduct the re-ranking process.
It is worth noting that, in our work, we directly use stem extraction method rather than similarity calculation to re-rank.The insight behind this choice is that, it is unnecessary to consider semantic features again, since we have already injected the question semantic information into our encoded semantic vector s (t) , which means that the model is already equipped with semantic clustering capability.And obviously, stem extraction costs fewer computation resources, as proved in Appendix A.2. Also, our SERR can be migrated to other models as a plug-in and independent module.

Datasets
We conduct experiments on two popular benchmark datasets, including WebQuestionSP (Yih et al., 2015) and ComplexWebQuestions (Talmor and Berant, 2018).Specifically, WebQuestionSP (abbr.WebQSP) is composed of simple questions that can be answered within two hops reasoning, which is constructed based on Freebase (Bollacker et al., 2008).In contrast, ComplexWebQuestions (abbr.CWQ) is larger and more complicated, where the answers require multi-hop reasoning over several KB facts.The detailed statistics of the two datasets are summarized in Table 1.generate P ′ = ⟨e, P s (r c )⟩ ∪ ⟨e

Experimental Setting
Basic setting.To make a fair comparison with other methods, we follow existing works (Sun et al., 2019(Sun et al., , 2018;;He et al., 2021) to process datasets, including candidates generation by PageRank-Nibble algorithm and subgraph construction within threehops by retrieving from topic entities.We set the learning rate as 8e−4 and decay it linearly throughout iterations on both datasets.We set the number of training epoch on WebQSP and CWQ as 200 and 100, respectively.For better reproducibility, we give all the parameter settings in Appendix A.3.
Baselines.We compare our method with multiple representative methods, including semantic pars-Models WebQSP CWQ Hits@1 F1 Hits@1 F1 ing (SP)-based methods and information retrieval (IR)-based methods.SPARQA (Sun et al., 2020) and QGG (Lan and Jiang, 2020) belong to the former category, which focuses on generating optimal query structures.Besides, KV-Mem (Miller et al., 2016), EmbedKGQA (Saxena et al., 2020), Graft-Net (Sun et al., 2018), PullNet (Sun et al., 2019) Evaluation metrics.To fully evaluate KBQA performance, we should compare both the retrieved and ranked candidates with correct answers.To this end, we employ the commonly-used F1 score and Hit@1.F1 score measures whether the retrieved candidates are correct, while Hit@1 evaluates whether the ranked candidate of the highest confidence is in answer sets.

Comparison with Others
We first compare our RE-KBQA against the aforementioned baselines on two datasets and the results are reported in datasets over both evaluation metrics.Particularly, compared with the results produced by RE-KBQA b , our full method improves more on CWQ dataset, which has increased by 3.5 and 5.8 in terms of Hit@1 and F1, showing that our contributions can indeed boost the multi-hop reasoning process.Besides, RE-KBQA also obtains good results on simple questions (i.e., WebQSP dataset), especially a 5.7 increase in F1 score, which reveals that the model can recall more effective candidates.
As shown in Table 2, we can observe that the SP-based methods (i.e., SPARQA and QGG) show a good performance in WebQSP, but perform worse in complicated questions, which reveals that SPbased methods are still weak in multi-hop reasoning.Similarly, traditional embedding methods, i.e., KV-Mem, EmbedKGQA, and GraftNet, also perform better in simple questions than in complex ones.Though PullNet and BiNSM show good multi-hop reasoning capacity, the extra corpora analysis and bi-directional reasoning mechanism inevitably increase the complexity of these networks.
Apart from above methods, some attempts are conducted on utilizing additional resources for task enhancement recently.As shown in Table 2 reference, CBR-KBQA relies on expensive largescale extra human annotations and Roberta pretrained model (PLM), Unik-QA tries to retrieve one-hundred extra context passages for relations in KB and T5-base (PLM), and KQA-Pro uses a large-scale dataset for pre-training with the help of explicit reasoning path annotation.While promising performance has been achieved through these methods, expensive human annotation costs and model efficiency also need to be concerned.

Network Component Analysis
To evaluate the effectiveness of each major component in our method, we conducted a comprehensive ablation study.In detail, similar to Section 5.3, we remove all three components and denote the backbone network as RE-KBQA b .Then, we add QA-VGAE (Section 4.2), multi-task learning (Section 4.3), and SERR (Section 4.4) back on RE-KBQA b , respectively.In this way, we constructed totally four network models and re-trained each model separately using the same settings of our RE-KBQA model.Table 3 shows the results.By comparing different cases with the bottom-most row (our full pipeline), we can see that each component contributes to improving the performance on both datasets.More ablation experiments can be found in Appendix.Below, we shall discuss the effect of each module separately.
Effect of QA-VGAE.From the results of Table 3, we can observe that the improvements of using QA-VGAE are more remarkable than using the other two modules, demonstrating that the QA-VGAE is more helpful to boost the reasoning process for both simple and complex questions.Besides quantitative comparison, we also tried to reveal its effect in a visual manner.Here, we adopt T-SNE to visualize the relation vectors.Figure 5 shows a typical embedding distribution before and after QA-VGAE training.For a clear visualization, we randomly select some relations related to a case "What is the capital of Austria?".The orange nodes represent relations close to "location", such as "location.country.capital","location.country.first_level_divisions",etc., and the blue nodes denote the relations that are not covered by the question subgraph, which we call far relations.Obviously, after using QA-VGAE, the related relations (orange nodes in (b)) tend to get closer and the other nodes get farther.
Effect of multi-task learning.As shown in Table 3, the multi-learning module shows better performance in simple questions (see WebQSP dataset), since the relation distribution is denser than candidates distribution, thus causing the prediction to be more complicated along with the increase of reasoning steps.To fully explore the effect of this module, we study different loss fusion weights and the results are shown in Figure 6, where a larger λ (range from 0.1 to 1.0, and we discard the setting of 0.0 for its bad performance)  denotes a more weighted loss of main task.Clearly, only designing the primary task or auxiliary task is not optimal for KBQA, and the best setting of λ is 0.1 and 0.5 for the two datasets.An interesting observation is that the best Hit@1 is obtained with lower lambda while the best F1 score is obtained with higher lambda in each dataset.We claim that it is caused by the different goals of Hit@1 and F1 metrics, that is, Hit@1 shows whether the top one candidate is found while F1 score evaluates whether most candidates are found.
Effect of SERR.This module is lightweight (see Appendix A.2 for inference time) yet effective, especially for simple questions; see Table 3. Intuitively, the stem extraction for key paths is quite effective for questions that rely on directconnected facts.In contrast, stem extraction for complex questions relies more on the startpoint and endpoint.Figure 7(a) further shows an example result of SERR module, which proves that it can effectively identify close connected facts of a given question and re-rank the candidates.

Case Study
At last, we show a case result produced by our RE-KBQA; see Figure 7(b).Given the question "What are the movies that had Tupac in them and which were filmed in New York City?", our method first  embedded the question into vectors and retrieve related subgraphs.Then, by utilizing the promotion of our proposed QA-VGAE and multi-task learning, we can use the trained model and obtain the candidates of "Murder Was the Case", "Nothing but Trouble", etc, and thanks to the SERR algorithm, our reasoning process can have a chance to re-rank the candidates, thus boosting its performance.Finally, we output Juice and Above the Rim as the correct answers.For similarity entity identification, SERR in other methods as a plug-in and more case results, please refer to Appendix A.2 and A.5.

Conclusion
In this paper, we proposed a novel framework, namely RE-KBQA, with three novel modules for knowledge base question answering, which are QA-VGAE to explore the relation promotion for entity representation, multi-task learning to exploit relations for more supervisions, and SERR to postprocess relations to re-rank candidates.Extensive experiments validate the superior performance of our method compared with state-of-the-art IRbased approaches.

Limitations
While good performance has been achieved, there are still limitations in our work.First, though QA-VGAE extracts enhanced features and are fast to train, it is an independent module from the main framework.Second, as a post-processing step, the performance of SERR module on simple question is better than that of complex questions.
In the future, we would like to explore the possibility of fusing relation constraints into the representation module directly and inject strong facts identification mechanism as guidance signal of multi-hop reasoning process, aiming to integrate QA-VGAE and SERR into the main framework.Training Goal.We adopt encoder-decoder models to conduct relation reconstruction tasks.Given the prepared adjacent matrix A, and feature matrix X, we use a two-layer GCN as a distribution learning model to estimate its mean and variance.The training loss function is formalized as: ) where Z is calculated by Equation 5, KL is the Kullback-Leibler divergence, q(•) and p(•) denotes the encoder and decoder respectively, please refer to Kipf and Welling (2016) for more details.
Settings.Specifically, A is defined as the matrix of neighborhood relations between nodes, where we set A(i, j) as 1 if there is a connection between relation r i and r j , and 0 for no connection.X is the feature matrix defined as the connectivity, which is accumulated as the number of edges between two nodes, aiming to show the importance of a relation.We set an empirical thresh of each element in the feature matrix to avoid extremely large values to hurt the model's training, such as the degrees of "Common.type_of" is quite huge, defined as: where c is connectivity, τ is an emprical hyperparameter, and we set τ as 2000 in our work.

A.2 SERR Algorithm
Complexity Analysis.Definitely, applying semantic similarity between relations and given questions is a more straightforward method to identify strong relations.However, the process of such a method is more complicated and time-consuming.
To prove the efficiency of our method, we conduct a comparison experiment to reflect the complexity of the two methods.As is shown in Table 4, the top two rows denote semantic similarity method, and the last row denotes our method.Obviously, our method is more lightweight without extra pre-trained models and the dependence on GPU resources.For comparison, we adopt Bertbase-uncased model to conduct the semantic similarity process in this experiment, which can be downloaded in https://huggingface.co/ bert-base-uncased.Performance Analysis.Besides, to demonstrate it can be plug-in and infer cases quickly, we further validate its accuracy and inference time, as is shown in Table 5, Note that, since SERR relies on traditional stem extraction rather than semantic understanding to identify the key paths, there is no training period for SERR, and it can be applied to any information-retrieval(IR)-based methods.Finally, to demonstrate the plug-in attributes of the SERR module, we integrate this module into BiNSM network (He et al., 2021) and the results are shown in Table 6.The results show that SERR can indeed increase the Hit@1/F1 score from 74.3/67.4 to 74.8/68.0 in the WebQSP dataset, and from 48.8/44.0 to 49.5/45.3 in the CWQ dataset.A.3 Hyper-parameter Setting.
In order to help reproduce RE-KBQA and its reasoning performance, as shown in Table 7, we list the hyper-parameters of the best results on two benchmark datasets.For the WebQSP dataset, the Simple questions.As shown in Figure 9, we show a case of one-hop reasoning on the WebQSP dataset, which proved that RE-KBQA performs well in simple question answering, as the main network can recall correct candidates and the SERR module can effectively re-rank the candidates.
Similarity entity identification.To demonstrate our method can indeed distinguish similar entities, we choose a case that needs to reason across similar entities as is shown in Figure10(a).While most of the surrounding edges are the same among candidates of the first step, our method can still select the correct node as the final answer.
RE-KBQA reasoning process.shows a three-hop reasoning case of our method, to intuitively demonstrate that our method can effectively conduct a multi-hop reasoning process.Note that, the reasoning process of our method can be illustrated as the status transfer of the relation V from one distribution into another, which is not strictly consistent along the reasoning path, thus in some degree solve the problem of knowledge base incompleteness.Specifically, (a) is to demonstrate that our model can reason correct answers across similar entities that benefited from QA-VGAE in case "What is the capital of Austria?".(b) aims to show the full pipeline of our proposed method in case "Which man is the leader of the country that uses Libya, Libya, Libya as its national anthem?" .

Figure 2 :
Figure 2: Framework of our proposed approach RE-KBQA.Given a question expressed in natural language, we first employ question embedding to encode semantic vectors.Then, we employ QA-VGAE enhanced representation module to learn candidate vectors V (t) c , aiming to identify similar entities and key reasoning paths while reasoning.At last, a multi-task learning module is proposed to promote training procedure.

Figure 3 :
Figure 3: Illustration of training QA-VGAE, including a total of three steps.We adopt two-layers GCN as encoder formalized as GCN σ and GCN µ .
) where { V (t) c i } l i=1 denotes all the candidate vectors.However, like most existing works (Deng et al., 2019; Lange and Riedmiller, 2010), learning from the final answers as the feedback tends to make the model hard to train, due to the limited supervision.How to introduce extra supervision signals into network model is still an open question.In our method, we introduce a new multi-task to learn the distribution of candidates' surrounding relations, namely surrounding relations reasoning.The key idea is to leverage relations around final answer as extra supervisions to promote the performance, and also modify reasoning paths implicitly.

Figure 4 :
Figure4: Illustration of SERR algorithm, where stem match mechanism is introduced between KB relations and given questions.If key reasoning relations exist, the rank of candidates will be increased.

Algorithm 1
Stem Extraction Re-Ranking Input: natural language question Q, candidates C, confidence p C , relation set R. Output: updated candidates C ′ and confidence p ′ C .1: <* Step 1: Build Relation Trie P s *> 2: ∅ → P s 3: for all r in R do 4: index i, stem s = F(r) 5: P s .update(⟨i,s⟩) 6: end for 7: for all {q, c, p c } in {Q, C, p C } do 3: Re-Ranking c and p c *> 12: r c = match(P e stem , P s ) 13:generate P = ⟨e, P s (r c ), e ′ ⟩ 14: , ReTraCk (Chen et al., 2021) and BiNSM (He et al., 2021) are all IR-based methods, which are also the focus of our comparison.

Figure 5 :
Figure 5: Relation vector visualization in the case of "What is the capital of Austria?" via T-SNE.Orange nodes indicate relations close to "location" and blue nodes indicate far relations.

Figure 6 :
Figure 6: Analysis of using different loss fusion weights among two benchmark test sets in multi-task learning.
(a) Case analysis for SERR with the question of "What Chamorro Time Zone countries have territories in Oceania?".To be clear, we just show top three representative paths here.(b) Case analysis for RE-KBQA with proposed modules.

Figure 7 :
Figure 7: Case analysis of multi-hop reasoning process.

Figure 9 :
Figure 9: An example of one-hop reasoning process produced by our method.
Figure 10(b) Figure 10: Cases Analysis of similar entity and thorough process of RE-KBQA compared with backbone network.Specifically, (a) is to demonstrate that our model can reason correct answers across similar entities that benefited from QA-VGAE in case "What is the capital of Austria?".(b) aims to show the full pipeline of our proposed method in case "Which man is the leader of the country that uses Libya, Libya, Libya as its national anthem?"

Table 1 :
Statistics of WebQSP and CWQ datasets.Note that, Entities and Relations denote all the entities and relations covered in the subgraph respectively.
′ , P s (r c )⟩ 15: for all p in P ∪ P ′ do 16: if p in g sub and p in P then if p in g sub and p in P ′ then 20: M(p c , h 2 ) c ′ = R(c) and p ′ c = R(p c ) 24: end for

Table 2 :
Performance comparison over state-of-the-art IR-based approaches on WebQSP and CWQ datasets, where bold fonts denote the best scores, * denotes scores from original paper and † are from Zhang et al. (2022a).

Table 3 :
Comparing our full pipeline (bottom row) with various cases in the ablation study.The cells with different background colors reveal the improvement over our backbone network RE-KBQA b .
knowledge.In Proceedings of the 2008 ACM SIG-MOD international conference on Management of data, pages 1247-1250.

Table 4 :
Comparing SERR module with semantic similarity method, i.e., cosine distance and euclidean distance in terms of model parameters and computing resources.Time row denotes total handling time (minutes).Params row denotes model size (MB)

Table 5 :
Performance of SERR algorithm in terms of accuracy score and inference time in two benchmark datasets.The accuracy score is calculated among recalled cases where close facts lie in its subgraph.