Large-Scale Relation Learning for Question Answering over Knowledge Bases with Pre-trained Language Models

The key challenge of question answering over knowledge bases (KBQA) is the inconsistency between the natural language questions and the reasoning paths in the knowledge base (KB). Recent graph-based KBQA methods are good at grasping the topological structure of the graph but often ignore the textual information carried by the nodes and edges. Meanwhile, pre-trained language models learn massive open-world knowledge from the large corpus, but it is in the natural language form and not structured. To bridge the gap between the natural language and the structured KB, we propose three relation learning tasks for BERT-based KBQA, including relation extraction, relation matching, and relation reasoning. By relation-augmented training, the model learns to align the natural language expressions to the relations in the KB as well as reason over the missing connections in the KB. Experiments on WebQSP show that our method consistently outperforms other baselines, especially when the KB is incomplete.


Introduction
Question Answering over Knowledge Base (KBQA) aims to find the answers to a natural language question given the structured knowledge base (KB) and is widely used in modern question answering and information retrieval systems. Traditional retrieval-based KBQA approaches typically build it as a pipeline system, including name entity recognization, entity linking, subgraph retrieval, and entity scoring. In recent years, with the help of deep representation learning, such approaches have achieved remarkable performance (Dong et al., 2015;Miller et al., 2016;Xu et al., 2016;Sun et al., , 2019Saxena et al., 2020;He et al., 2021). However, the KBQA task is still challenging especially for multi-hop questions because of two reasons: 1) Due to the complexity of human language, it is often difficult to align the natural language questions with the reasoning paths in the KB. The model tends to learn by surface matching and easily takes shortcut features (Du et al., 2021) for prediction (shown in Figure 1a). 2) In practice, the KB is often incomplete, which also requires the model to reason over the incomplete graph. But the model always fails to do that since it lacks explicit training on reasoning (shown in Figure 1b).
Previous works such as GraftNet  and PullNet (Sun et al., 2019) mainly solve these problems by introducing external text corpus (e.g. all wikipedia documents) and use specially designed network architecture to incorporate information from the documents. However, the required external resources may be hard to collect in practice. EmbedKGQA (Saxena et al., 2020)   tion embeddings to be fit in the relation embedding space such that they can directly use the scoring function to rank answers. However, their approach mainly grasps the topological structure of the graph but ignores the textual information in entities and relations that should be also useful to score candidate entities.
In this paper, to learn a better mapping from the natural language questions to the reasoning paths in the KB (Gao et al., 2020;Bouraoui et al., 2020), we reformulate the retrieval-based KBQA task to make it a question-context matching form and propose three auxiliary tasks for relation learning, namely relation extraction (RE), relation matching (RM) and relation reasoning (RR). RE and RM both take advantage of the relation extraction datasets, including WebRED (Ormandi et al., 2021) and FewRel (Han et al., 2018). RE trains the model through inferring relations from the sentences, and RM through determining whether two sentences express the same relation. RR constructs the training data from the KB in a self-supervised manner and trains the model to reason over the missing KB connections given the existing paths.
Our contributions can be summarized as follows: 1) To bridge the gap between natural language and the structured KB, we reformulate the KBQA task to be a question-context matching problem and propose auxiliary tasks to enhance the implicit relation learning for pre-trained language models (Devlin et al., 2019). 2) To mitigate the KB's incompleteness issue, we further propose a task for relation reasoning on the KB. 3) Experiments on WebQSP show the effectiveness of our proposed approach, especially when the KB is highly incomplete. 1 1 Our code is available at https://github.com/ yym6472/KBQARelationLearning 2 Approach Problem Definition In this paper, we mainly focus on the retrieval-based KBQA. Given an input query q, we first annotate the named entities in the query and link them to the nodes in the KB. Then some heuristic algorithm 2 is applied to retrieve a queryspecified subgraph G = { e, r, e |e, e ∈ E, r ∈ R}, where E is the set of all candidate entities that probably contains the answer of q, and R denotes the relation set. Our task is to calculate a score s i for each candidate entity e i ∈ E indicating whether e i is the answer entity or not.
In this section, we first present how to solve KBQA with BERT, then we introduce three proposed auxiliary tasks to augment the relation learning for BERT.

BERT for KBQA
For each question q, we can obtain its topic entity e topic from the entity linking system. 3 Then, as shown in Figure 2a, we convert the candidate entity scoring problem into a question-context matching task as follows.
We first find all paths in G that connect the topic entity e topic and the candidate entity e i . We set a maximum number of paths 4 and apply downsampling when the number exceeds the threshold. Then we construct the textual form of each path by replacing the nodes with entity names and the edges with relation names in the KB. Finally, we concatenate the question q and all paths p 1 , ..., p n to make an input sample Here, we regard these paths as the facts between the topic entity e topic and the candidate entity e i . We aim to use BERT to predict whether the hypothesis "e i is the answer of q" is supported by those KB facts. Thus, we feed the sample to BERT and take the representation corresponding to [CLS] token for binary classification: , where σ is the sigmoid function and y is ground truth label indicating whether e i is the answer entity of q or not.

Auxiliary Tasks for Relation Learning
The performance of KBQA depends heavily on the mapping from the natural language questions to the relations in the path. To further enhance the relation learning of BERT, we propose three auxiliary tasks for relation learning, as shown in Figure 2b.
Relation Extraction (RE) One straightforward idea is to use the relation extraction dataset, where the model learns to extract the relation expressed in the sentence between the given head and tail entity. Similar to KBQA, we concatenate the sentence and the one-hop path to construct an RE example for BERT: , where s, h, r and t indicates sentence, head entity, relation and tail entity respectively.
Moreover, to simulate the 2-hop reasoning in KBQA, we also combine two RE examples to make a compositional one: where the tail entity of the first example is same to the head entity of the second example.
Relation Matching (RM) In relation matching task, we assume that two sentences with the same relation should have similar representations. Thus, we concatenate two sentences and train BERT through predicting whether two sentences express the same relation: where the label is 1 if s 1 and s 2 express the same relation and 0 otherwise.
Relation Reasoning (RR) BERTRL (Zha et al., 2021) proposes a self-supervised approach for KB completion task. They choose one triplet (h, r, t) from the KB and assume it is missing. Then they find other multi-hop paths from h to t, and use them to predict whether (h, r, t) exists in the KB or not: By training on BERTRL, the model learns to reason and complete the missing connections, which is extremely helpful for KBQA on the incomplete KB.
Training Since all three auxiliary tasks are formulated as a binary classification task and only differ in the data construction phase, we can either use them to pre-train BERT before KBQA (noted as pre-train) or train them jointly with KBQA in a multi-task paradigm (noded as joint). In our experiments, we find both settings work well and produce similar results (see Section 3.4 for more details). We obtain and preprocess WebQSP using the scripts 5 released by . It mainly includes entity linking and subgraph retrieval in two steps. The entity linking results are directly taken from the codebase 6 released by Yih et al. (2015b). For each question, there is a set of seed entities 7 and will be used in the subgraph retrieval phase. The subgraphs are retrieved through the Personalized PageRank (PPR) algorithm (Haveliwala, 2002), and we set the max number of entities in each subgraph to 500. Among the 1639 examples in the test set, the answers of 120 questions are not retrieved from the subgraph, so the answer coverage is about 92.68% in the subgraph retrieval phase.
Relation Extraction Datasets In the relation learning tasks, we use WebRED (Ormandi et al., 2021) and FewRel (Han et al., 2018) dataset as external resources. For more details about these datasets and how we process them to construct relation learning tasks, please refer to Appendix A.

Baselines
We compare our approach to several baselines, including KV-Mem (Miller et al., 2016), GraftNet , PullNet (Sun et al., 2019), Em-bedKGQA (Saxena et al., 2020) and NSM (He et al., 2021). Please refer to Appendix B for more details. Besides, we also provide results of BERT (without additional relation learning) as a baseline to show the effectiveness of our proposed relation learning tasks.

Metrics
When evaluating our model, we first feed each linearized input to BERT and get the corresponding score between 0 to 1. For each question, we rank all candidate entities in the subgraph by the scores and calculate the hits@1 and F1 as follows: • Hits@1 If the highest-ranked entity is the answer entity, then hits@1 is 1. Otherwise, hits@1 is 0.
• F1 score Given a threshold, we consider all candidate entities whose scores are greater than the threshold as the answers predicted by the model. Then we calculate the F1 score between the ground truth answer entities and the model predicted answer entities. In our experiments, we select the threshold that performs best in the validation set.
Then we average the Hits@1 and F1 scores over all test examples. For questions whose answers are not covered by the retrieved subgraph, we regard them as wrong predictions. Note that we treat hits@1 as the primary metrics, since the results of F1 score show a large variance due to its sensitivity to the threshold. We provide more training details in Appendix C.

Main Results
The experimental results are shown in Table 1. We find that the results with BERT outperform most of the baselines (except for the NSM). When comparing to PullNet, BERT achieves a relative improvement of 4.6%, demonstrating the effectiveness of solving KBQA with BERT.
On the other hand, the results with all three relation learning tasks (72.3) significantly outperform the BERT baseline (71.2), showing that the proposed auxiliary tasks benefit the relation matching and relation reasoning of BERT. Ablation Studies To check which task contributes to the final result most, we conduct experiments where only one task is applied at a time. From the second part of Table 1, we can observe that RE and RM are the two most contributing tasks, and even training with them individually can outperform training with all three tasks together. Meanwhile, RR also brings performance improvement (from 71.2 to 71.7) under the pre-train setting, but its improvement is not as significant as RE and RM. This may be because the model doesn't require much reasoning ability under the full KB setting.
Pre-training or Joint Training When comparing the pre-training setting with joint training, we find both settings work well and outperform the BERT baseline. For RE and RR, pre-training seems better than joint training, while for RM, joint training is slightly better.

Analysis
Results over the Incomplete KB To verify the robustness of our approach when the KB is incomplete, we randomly remove 50% of the KB facts in the retrieved subgraphs and conduct experiments on this incomplete version of the WebQSP dataset.
The results 8 are illustrated in Figure 3. We can 8 Appendix D provides more results under the incomplete KB (with different proportions) as well as the comparison to baselines.  Figure 3: The performance comparison with the full KB and the 50% KB. We only compare to baselines that also report results with 50% KB.

Model
Annotation Type none all head-tail  observe that: 1) Our approach consistently outperforms other baselines under both the full KB and the 50% KB settings. 2) With 50% KB, adding relation learning tasks achieve more performance gain than with full KB (+2.1 vs +1.1), demonstrating that our relation learning tasks are especially useful when the KB is incomplete. Annotation of Entity Spans As discussed in Soares et al. (2019), different markers for entity spans have a great impact on the BERT-based relation extraction task. To find out the best annotation strategy for KBQA, we conduct experiments with three types of annotations: 1) Using no annotation (noted as none). 2) Using <E> and </E> to annotate all entities in the reasoning paths (noted as all).
As shown in Table 2, we find none performs worst while head-tail achieves the best result. We can conclude that the annotations of the entity spans are still required for the BERT model. They bring structural information that helps the model to identify the entity. Meanwhile, fine-grained annotations (head-tail) are better than the coarse-grained ones (all).
Influence of Negative Samples In our experiments, we want to speed up the training by downsampling negative samples of KBQA. However, as shown in Table 3, we find that the performance is also related to the number of negative samples. In general, more negative samples will bring a higher  Hits@1 score. One potential solution to this issue is hard negative mining, and we will leave it for future work.

Conclusion
In this paper, we propose three auxiliary tasks to augment relation learning for BERT-based KBQA method, including relation extraction, matching and reasoning. These tasks not only bridge the gap between the natural language and the structured KB, but also explicitly train the model to reason over the incomplete KB. The experimental results on WebQSP demonstrate the effectiveness of our approach, especially when the KB is incomplete.

B Baselines
We compare our approach to the following baselines: KV-Mem (Miller et al., 2016) adopts a keyvalue memory network to store the KB facts and uses it to augment the open domain question answering.
GraftNet  propose to solve open domain question answering task by retrieving from the KB and the textual corpus and design a variant of graph convolution network for the heterogeneous graph.
PullNet (Sun et al., 2019) uses GraftNet as the model architecture but it also learns how to retrieve information and expand the subgraph during the training and test phase.
EmbedKGQA (Saxena et al., 2020) uses the pre-trained KB embeddings and trains the question encoder to make question embeddings aligned with the relation embedding space such that they can directly use the scoring function to predict whether a given entity is the answer or not.
NSM (He et al., 2021) propose to use the neural state machine (NSM) to solve the KBQA task and uses bidirectional hybrid reasoning and a twostage teacher-student architecture to augment the reasoning ability of the student model.

C Training Details
We run all our experiments on one single NVIDIA Tesla V100 (32GB) GPU. We set the batch size to 128 and set the max sequence length to 128 for BERT. We evaluate the model every 1000 or 3000 training steps depending on the number of total training steps in one epoch, and the evaluation takes about 6 minutes. We train the model for up to 3 epochs and use a learning rate of 2e-5. For the pre-trained BERT, we download the bert-base-uncased model from Hugging-Face 12 , and set the dropout rate to 0.2 during training. The best results are typically achieved after training BERT for 2-3 epochs (roughly 15,000 -25,000 steps), which often takes 6-8 hours (roughly 1.8 steps per second) for training. The number of model parameters is 109,483,009 (109M), including the parameters of BERT and the linear head for binary classification. For all hyperparameters used in our experiments, we manually tune them on the reserved 250 train examples of WebQSP. F1-score is used as the metric to select the best hyperparameters.

D More Results over the Incomplete KB
We show more experimental results on incomplete KB in Table 4. We can make the following observations: 1) When the KB is extremely incomplete (10% KB and 30% KB), our approach can achieve significant performance gain compared to previous work GraftNet ) (+7.2 on 10% KB setting and +11.5 on 30% KB setting). 2) The relation reasoning (RR) task is well performed when the KB is extremely incomplete (10% KB and 30% KB), but the performance gain decreases when the KB is relatively complete (50% KB and Full KB).
3) The relation matching (RM) task is the most robust task that shows very strong performance gain with different KB's incompleteness.