A Multi-label Multi-hop Relation Detection Model based on Relation-aware Sequence Generation

Multi-hop relation detection in Knowledge Base Question Answering (KBQA) aims at retrieving the relation path starting from the topic entity to the answer node based on a given question, where the relation path may comprise multiple relations. Most of the existing methods treat it as a single-label learning problem while ignoring the fact that for some complex questions, there exist multiple correct relation paths in knowledge bases. Therefore, in this paper, multi-hop relation detection is considered as a multi-label learning problem. However, performing multi-label multi-hop relation detection is challenging since the numbers of both the labels and the hops are un-known. To tackle this challenge, multi-label multi-hop relation detection is formulated as a sequence generation task. A relation-aware sequence relation generation model is proposed to solve the problem in an end-to-end manner. Experimental results show the effectiveness of the proposed method for relation detection and KBQA.


Introduction
With the development of Knowledge Bases (KBs) such as DBpedia, Freebase, and WikiData, Knowledge Base Question Answering (KBQA) system (Berant et al., 2013;Bordes et al., 2015;Yin et al., 2016;Hao et al., 2018) is attracting more and more attention. The KBQA system often contains two core components: (1) entity linking, which identifies the topic entity mentioned in the question; (2) relation detection, which detects the relation paths starting from the topic entity to the answer node.
Relation detection in KBQA can be categorized into singe-relation (one-hop) detection and multirelation (multi-hop) detection. Most existing singlerelation detection methods (Yin et al., 2016;Yu et al., 2017;Lukovnikov et al., 2017;   2018) rely on measuring the semantic similarity of questions and candidate relations. He and Golub (2016) proposed an encoder-decoder based generative framework for single-relation extraction. For multi-relation detection, some approaches (Yih et al., 2015;Yu et al., 2017Yu et al., , 2018 proposed to tackle two or three-relation detection by applying some constraint which makes the number of hops fixed. Xiong et al. (2017) and Das et al. (2017) modeled the relation reasoning problem as a Markov decision process. Chen et al. (2019) exploited a transition-based search framework to select the relation dynamically. Recently, some researchers attempt to model prediction uncertainties in the simple question answering task with Bayesian neural network (Zhang et al., 2021). Generally, most of existing methods focus on detecting one optimal relation path, considering the task a single-label learning problem.
However, for some question, there may exist multiple relation paths to the correct answer. For example, as shown in the upper part of Figure 1, there are two distinct relation paths time/time/event and olympic/olymipic_games/host_city with the same meaning, making the instance multi-label. Moreover, as shown in the lower part of Figure 1, there are two relation paths starting from different topic entities Pierce Brosnan and 007. A robust KBQA system should be able to infer the final answer based on multiple relation paths. We therefore consider multi-label multi-hop relation detection in this paper.
Nevertheless, it is challenging to perform multi-label multi-hop relation detection since both the number of relation paths and the number of hops in each relation path are unknown. To deal with such a challenge, in this paper, we formulate it as a sequence generation task in the following form {r 1 1 , . . . , r 1 n 1 ,[SEP], r 2 1 , . . . , r 2 n 2 ,[SEP],r m 1 , . . . , r m nm ,[END]}, where r j i denotes the i-th relation in j-th path, the comma splits the relation chains and the [SEP] indicates the division of different relation paths. A relation-aware sequence generation model (RSGM) is proposed to learn the sequence generation task end-to-end, without the need of knowing the number of labels and hops beforehand. In specific, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is employed as the encoder of RSGM while a Gated Recurrent Unit (GRU) with relation-aware attention is design as the decoder to incorporating the semantic information of the relations. Moreover, a constraint-learning strategy is proposed to mitigate the exposure bias and label repetition problem in sequence generation.
The main contributions of this paper are: • An end-to-end relation-aware sequence generation model, RSGM, is proposed to deal with the multi-label multi-hop relation detection problem.
• Experimental results show the effectiveness of proposed model both on relation detection and KBQA end-task.

Problem Setting
Let G = (S, R, O) be the KB, where S represents the set of subject entities, O represents the set of object entities, and R represents the set of relations between the subject entities and the object entities. Assume a set of questions Given question q and knowledge base G, traditional methods treat multi-hop relation detection as a single-label learning problem, aiming at find the optimal relation path p , where p = (r 1 , r 2 , . . . , r n |r i ∈ R) (1) In this paper, multi-label multi-hop relation detection is considered. Therefore, the objective is to find a set of relation paths P based on question q and knowledge base G, where As both the number of labels and the number of hops are unknown, the task is formulated as a sequence generation problem. The objective is to generate a token sequence Y given question q and knowledge base G, where Y = (y 1 , y 2 , . . . , y M ) where the [SEP] indicates the division of different relation paths and [END] indicates the end of the sequence.

The Proposed Model
The overall architecture of the proposed relationaware sequence generation model (RSGM) is presented in Figure 2. It consists of two components which will be discussed in more details: (1) the question encoder, where the question is transformed into a context-aware representation using BERT; (2) the relation decoder, where a GRU is employed to generate the relation sequentially with a relation-aware attention.
Question Encoder To utilize the abundant semantic information of the large pre-trained model, BERT is employed as the encoder, which takes the question as the input and learns the context-aware representation for each token. Specifically, for the given question q i , the token [CLS] is inserted as the first token to obtain the representation of the whole question, i.e., The representation of the word w i is computed as the sum of three embeddings: token embedding h tok i , segmentation embedding h seg i , and the position embedding h pos i , which is denoted as h w i , As a result, a list of token embeddings W 0 = {h q , h w 1 , h w 2 . . . h w N } are obtained and then fed into a series of L pre-trained transformer blocks, Relation Decoder To generate relation sequentially, a GRU is employed. The prediction y i at time i is affected by three factors, the hidden state of time i − 1, the token predicted at time i − 1 and the relation-aware question representation, formally, where s i−1 denotes the hidden state of time i − 1, h y i−1 denotes the embedding of token y i−1 predicted at time i − 1, c i denotes the relation-aware question representation. Assuming the relation r consisting of N r words, its embedding is defined as the sum of its word embeddings to encode its semantic information: The token [SEP] is initialized randomly and updated during training.
The relation-aware question representation c i is calculated by taking the word embedding h w i as the key and the value, and the hidden state s i−1 as the query:

Training and Inference
The proposed model is trained under regular sequence-to-sequence loss by maximizing the likelihood of the ground-truth token sequence. At the training stage, to bridge the gap between training and inference, the scheduled sampling policy (Bengio et al., 2015) is employed, which exploits part of the ground-truth to guide model learning. At the testing stage, the beam search optimization approach (Wiseman and Rush, 2016) is used to mitigate the exposure bias problem. Additionally, to avoid the repetition problem, a constraint mechanism is added. Unlike text generation or machine translation tasks, the multiple paths to the same question are usually mutually exclusive. Therefore, when a relation path is generated, an infinite penalty is added for that relation path, in order to avoid it being generated again.

Experiments
We conduct experiments on a large KBQA benchmark dataset FreebaseQA (Jiang et al., 2019) to evaluate the effectiveness of the proposed RSGM model.

Dataset
FreebaseQA (Jiang et al., 2019) is a novel KBQA dataset generated by matching trivia-type questionanswer pairs with facts existed in FreeBase. In particular, for each question in the dataset, there may often exists multiple multi-hop relation paths that can give rise to the correct answer. Here multihop means it should takes multiple hops in the knowledge base to reach the correct answer node. Compared with the existing well-known KBQA datasets SimpleQuestions (Bordes et al., 2015) and WebQuestion (Berant et al., 2013), it has the following characteristics: (1) for the give question, it provides multiple annotated relation paths to the  correct answer as ground-truth.
(2) the linguistic structure of the question is more sophisticated.
(3) more training instances is provided enabling effectively training for neural networks. The detailed statistics are shown in Table 2. The FreebaseQA dataset is publicly available 1 . Since FreebaseQA is the only KBQA dataset that is annotated with multiple relations, we mainly conduct our experiments on this dataset.

Baselines
To our best knowledge, there is no other KBQA method consider multi-label multi-hop relation detection, so to demonstrate the effectiveness of the proposed model, we mainly include the following baselines and modify them to perform a multi-label prediction task: • CNN-multichannel (Kim, 2014): multiple filters are employed to extract features of sentences and a full connected layer with the sigmoid function is utilized to obtain the probability of each label.
• MLKNN (Zhang and Zhou, 2014): the maximum a posterior (MAP) rule is exploited to make prediction by reasoning with the labeling information implied in the k-nearest neighbors, which ignores utilizing label correlations.
1 http://github.com/infinitecold/ FreebaseQA • HAN (Yang et al., 2016): a hierarchical attention network is employed to obtain sentence representations and then generate document representations based on sentence representations.
• SGM : a novel sequence-tosequence structure with global embedding is proposed to capture the correlations between labels.
• SGM-BERT: a variant of SGM by replacing the encoder of SGM with BERT.

Evaluation Metrics
To evaluate the performance of different approaches, several evaluation metrics are employed including Precision, Recall, Micro-F1 score and Hamming Loss as suggested in (Zhang and Zhou, 2007).

Model Setup
The uncased BERT base is employed as text encoder, with the parameters fine-tuned during training. For decoder GRU, the hidden state dimension is set to 128 and beam size is 5. The whole model is trained by the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-4 and a dropout rate of 0.3. The number of epochs is 10 and the mini-batch size of the input is set at 20. The parameters are chosen based on the evaluation results from dev subset.

Relation Detection Results
Experimental results on the FreebaseQA benchmark are listed in Table 1. It can be observed that: (1) SGM-BERT outperforms SGM, demonstrating the effectiveness of BERT encoder; (2) The proposed RSGM outperforms SGM by large margin on all metrics. The reasons can be summarized as followings: (1) RSGM employs a relation-aware attention mechanism, which provide more informative sentence representation compared with vanilla self-attention; (2) RSGM formulated multi-label multi-hop relation detection as a sequence generation task, which gets a much smaller search space compared with other methods that perform multilabel perdiction task.

KBQA End-Task Results
To investigate the effectiveness of the proposed relation detection method for the KBQA end-task, we perform entity linking and retrieve the final answer with the relation path detected by RSGM.    The experimental results of KBQA end-task are shown in Table 4. The FOFE-net (Jiang et al., 2019) is a pipeline KBQA system built based on FOFEnet (Xu et al., 2017) , which achieves outstanding results on SimpleQuestions and WebQSP datasets. The RSGM result is obtained by performing entity linking and relation detection with proposed model.Based on the multiple relation paths generated by RSGM, a majority vote strategy is employed to get the final answer. The results show that RSGM outperforms FOFE-net in KBQA end-task on FreeBaseQA dataset.

Visualization of relation-aware attention
Let us refer back to the bottom example described in Figure 1, it can be observed that when predicting different relations, the different range of the question plays a different role. At the same time, attention mechanism between relations and questions can be utilized to select the most meaningful words in the given question. To demonstrate the above observation, the weights in attention layer are extracted and further visualized in different kinds of colors that reflect the contributions of different words. The results are shown in Figure 3. From the Figure, it can be observed that the attention is captured properly and indicates which parts in the question make more contribution. For instance, "pierce [UNK]" has been paid more attention when detecting the relation "film/film_character/portrayed_in_films".

Conclusion
In this paper, we frame multi-hop relation detection as a multi-label learning problem. To solve the challenge of multi-label multi-hop relation detection, we cast it as a sequence generation problem. A relation-aware sequence relation generation model is proposed to learn the problem in an end-to-end manner. Experimental results show that our approach not only achieves better relation detection performance, but also improves the results of the state-of-the-art KBQA system.