Complex Question Answering on knowledge graphs using machine translation and multi-task learning

Question answering (QA) over a knowledge graph (KG) is a task of answering a natural language (NL) query using the information stored in KG. In a real-world industrial setting, this involves addressing multiple challenges including entity linking, multi-hop reasoning over KG, etc. Traditional approaches handle these challenges in a modularized sequential manner where errors in one module lead to the accumulation of errors in downstream modules. Often these challenges are inter-related and the solutions to them can reinforce each other when handled simultaneously in an end-to-end learning setup. To this end, we propose a multi-task BERT based Neural Machine Translation (NMT) model to address these challenges. Through experimental analysis, we demonstrate the efficacy of our proposed approach on one publicly available and one proprietary dataset.


Introduction
Question answering on knowledge graphs (KGQA) has mainly been attempted on publicly available KGs such as Freebase Bollacker et al. (2008), DB-Pedia Lehmann et al. (2015), Yago Suchanek et al. (2007), etc. There is also a demand for questions answering on proprietary KGs created by large enterprises. For example, KGQA, on a) a KG that contains information related to retail products, can help the customers choose the right product for their needs, or b) a KG containing document catalogs (best practices, white papers, research papers) can help a knowledge worker find a specific piece of information, or c) a KG that stores profiles of various companies can be used to do preliminary analysis before giving them a loan, etc. Our motivating use-case comes from an enterprise system (referred to as LOCA) that is expected to answer users' questions about the R&D division of an enterprise. Figure 1: Example queries from a real-world dataset LOCA. Column 6 (is SP?) represents whether the queries can be answered via shortest path or not?, all the other columns are self-explanatory.
Sample questions from LOCA dataset are shown in Figure 1. The schema of the corresponding KG is shown in Figure 2. Answering such questions often requires a traversal of KG along multiple relations which may not form a directed chain graph, and may follow more complex topology as shown for question 5, 7 and 8 in Figure 1. It can also be observed that most often words of the natural language question (NLQ) and corresponding relations have a weak correlation. Most of the proposed approaches on the KGQA task Bollacker et al. (2008) parse the NLQ and convert it into a structured query and then execute the structured query on the KG to retrieve the factoid answers. Such conversion involves multiple sub-tasks: a) linking the mentioned entity with corresponding entity-node in the KG Blanco et al. (2015); Pappu et al. (2017), b) identification of the type of the answer entity Ziegler et al. (2017), c) identification of relations Dubey et al. (2018); ; Hakkani-Tür et al. (2014). These tasks are most often performed in a sequence Both et al. (2016); Dubey et al. (2016); Singh et al. (2018), or in parallel Veyseh (2016); Xu et al. (2014); Park et al. (2015), which results  in accumulation of errors Dubey et al. (2018). Further, most of the KGQA tasks are not as complex as LOCA. For example, a) All questions of Sim-pleQA Bordes et al. (2015) can be answered using single triple, b) NLQs for most of the datasets (e.g., SimpleQA, Meta QA) contain only one metioned entity, and c) Even if multiple relations are required for answer entity retrieval, they are organized in a sequence, i.e., chain.
Our motivating example contains specific types of questions that pose many challenges with respect to each of the aforementioned tasks. Moreover, some of the questions can only be answered via a model that attempts more than one sub-tasks together. For example, the first two questions of Figure 1 mention the same words, i.e., "deep learning" but they get associated with two different entity nodes of the KG. Additionally, the prior work could detect the set of relations when the schema sub-graph follows a specific topology, however, in our example, most of the questions follow a different topology. We demonstrate in Section 5 that most of the prior art approaches fail to solve such challenges. We provide a summary of such challenges in Section 2.
In this paper, we propose CQA-NMT, a novel transformer-based, NMT (neural machine translation) model to solve the aforementioned challenges by performing four tasks jointly using a single model, i.e., i) Detection of mentioned entities, ii) prediction of entity types of answer nodes, iii) prediction of topology and relations involved, and iv) question type classification such as 'Factoid', 'Count', etc. CQA-NMT not only performs the four sub-tasks but also helps downstream tasks of mentioned entity disambiguation and subsequent answer retrieval from the KG. The key contributions of this paper are: (i) We propose a multi-task model that performs all tasks for parsing of natural language question together, rather than the traditional approach of performing these tasks in a sequential manner, which also involves candidate generation based on upstream task and then short-listing them to make the final prediction. We also demonstrate that using such an approach newer types of challenges of the KGQA task can be solved, which have not been attempted by prior work so far.
(ii) We propose the use of neural machine translation based approach to retrieve the variable number of relations involved in answering a complex NLQ against a KG.
(iii) We also demonstrate that every sub-task of parsing an NLQ is complementary to other tasks and helps the model in performing better towards the final goal of KGQA. In Table 3, we have demonstrated that via joint training on more than one task, the accuracy of individual tasks improves as compared to training them separately. For example, when trained separately, the best F1-score for detecting mentioned entity(s) was 83.3, and the best accuracy for the prediction of entity types of answer nodes was 75.7. When trained jointly, we get the corresponding metrics as 87.1 and 76.3. When trained jointly for all tasks, the results improve even further.
(iv) CQA-NMT predicts the relations involved in a sub-graph of KG and also helps to predict the topology of the sub-graph, resulting in compositional reasoning via a neural network on the KG. However, the prior work predicts the relations for a specific topology only 1 .
(v) We also demonstrate that our approach outperforms the state-of-the-art approaches on the MetaQA dataset, and therefore we present a new baseline on this dataset. Our approach also performs better than standard approaches 1 Topology is a specific arrangement of how the mentioned entities, and answer entities are connected to each other via the predicted relations. Sample topologies are given in Figure 1. Our approach can be used to answer questions of any topology, if adequate number of samples are included in the training data. The prior works have not attempted a dataset such as LOCA which contains many different topologies. To the best of our efforts we could not find another such dataset, which has led to our aforementioned belief. as applicable to our dataset and helps us solve most of the real-world industrial challenges.

KGQA Problem and Challenges
For answering natural language questions (NLQ), we assume that the background knowledge is stored in a knowledge graph G, comprising of a set of nodes V (G), and edges E(G). Here, nodes represent entities, and edges represent the relationship between a pair of entities or connect an entity to one of its properties. An NLQ (q) is a sequence of words w i of a natural language (e.g., English), i.e., q = {w 1 , w 2 , ..., w N }. We also assume that the NLQs can mention zero, one, or more entities present in G and enquire about another entity of G, which is connected with the mentioned entity(s). We pose the KGQA problem as a supervised learning problem and next, describe the labels assumed to be available for every question in the training data and that need to be predicted for every question in the test data.
Entity Linking Annotation Some of the ngrams (η i ) in an NLQ refer to entity(s) of KG. Such n-grams have been underlined in Figure 1. The entity-id (as shown in the third column of Figure 1) of the mentioned entity is also assumed to be available as part of label annotation for every question.
Answer Entity Type Annotation (AET), τ : We assume that every NLQ has an entity type (t i ) for the answer entities. These are shown in the middle column of Figure 1. We refer τ as a set of all entity types in the knowledge graph G.
Relation Sequence and Topology Annotation (path) Sequence of relations connecting the linked entities to the answer entities can be considered paths (path i ), each of which can contain one or more relations. These paths are connected to form a topology, as shown in Figure 1. This topology of the paths and relations are also assumed to be available for an NLQ in training data. These paths need not be the shortest paths between the linked entities and the answer entities. For example, the last three columns of Figure 1 indicate a) the set of paths separated by a semicolon (;), b) whether this is the shortest path, and c) topology of the paths connecting the linked entities to the answer entities.
Question Type Annotation (q type ) Some NLQs can be answered by a single triple of the knowledge graph('Simple'), while some of them require traversal along with more complex topology as indicated earlier('Factoid'), some questions require an aggregate operation such as count ('Count', see question 6, in Figure 1), and finally, some questions perform existence check ('Boolean', see question 8, in Figure 1). Such information is also assumed to be available for every NLQ in training data.
We now describe the challenges that need to be addressed while performing the KGQA task. To the best of our efforts we could not find any prior work that covers all these challenges together.
1. Incomplete Entity Mention: In the NLQ users often do not mention the complete name of the intended entity Huang et al. (2019), e.g., only the first name of a person, short name of a group, etc., e.g., question 8 in Figure 1.
2. Co-occurrence disambiguation: For situations when a mentioned entity should be linked to KG entity with help of another mentioned entity in the question, e.g., in question 7 of Figure 1, there can be many people who have the same first name ('Libby') but there is only one of them who works on NLP, the models needs to use this information to conclusively resolve the mentioned entities Mohammed et al. (2017).
3. Avoid un-intended match: Some of the words in a sentence coincidently match with an entity name but are not an intended mention of an entity, e.g., the word 'vision' may get matched with 'Computer Vision' which is not intended in question 9 of Figure 1.

Duplicate KG Entity
The intended entity names may be different from the words used in the NLQ, and there can be more than one entity in the KG that has the same name Shen et al. (2019), for example, "Life Sciences" is the name of a research area, as well as a keyword (see KG schema given in Figure 2). The model needs to link the entity using other words, similar to how it is shown in questions 1 and 2 of Figure 1. 6. Implicit Relations Indication: Sometimes words of the NLQ do not even make any mention of the relations involved, however, they need to be inferred Zhang et al. (2018). For example, in question 4 of Figure 1, some of the relations are not mentioned in the question.
Problem Definition: The objective of the proposed approach is to output 1) the mentioned entity(s) (s i ) in the query, 2) the answer entity type , 3) the path or set of predicates, P q = {p 1 q , p 2 q , p 3 q , ..., p N q } where, each p i ∈ E(G) and, 4) the question type. The set P q is a sequence of predicates, such that if traversed along these edges from the mentioned entity node(s), we can arrive at the answer entity(s) node(s). The final answer is then retrieved from KG and is post-processed as per the outputs of question type and the 'answer entity type' modules. We assume that we have N training samples

Related Work
In this section, we present a view of prior work, on the KGQA problem as an NLP task, and then on set of techniques used for this task.  Luo et al. (2018) proposed an approach to perform KGQA by mapping a query to its logical form and then converting it to a formal query to extract answers. However, these are not joint learning tasks as proposed in our work.
Multi-Task based approaches Similar to us, many works like Lukovnikov et al. (2019) 2019) rely on the jointly learning multiple sub-tasks tasks of KGQA problem. However, all these approaches focus on singlehop relations only, and therefore we cannot take such approaches as a baseline for our model. In a more complex setting, Shen et al. (2019) proposed a joint learning task for entity linking, path prediction (chains topology only), and question type. However, their model does not predict answer entity type. We do not compare our approach with Shen et al. (2019) because they focus on the implicit mention of the entities in previous sentences of dialogue, and also because they do not attempt to predict non-chain topology or the answer entity type.
Non-Chain Multi-Hop Relations Agarwal et al. (2019) proposed an embedding based approach to predict non-chain multi-hop relation prediction (for a fixed and small set of topologies). They perform only one task of relationship prediction.

Techniques used for KGQA
Transformers and Machine Translation: Transformer Vaswani et al. (2017) has proved to be one of the most exciting approaches for NLP research. They have shown dominating results in Vaswani et al. (2017); Devlin et al. (2018), etc. The paper Lukovnikov et al. (2019) closely resembles our approach as they proposed a joint-learning based multi-task model using Transformer. However, they handle only 1-hop questions and consider relation prediction as a classification task. In its current form it cannot be used to solve the variable length path prediction form, as required in our motivating example. In an extension to the work of using logical forms for KGQA Dong and Lapata (2016) proposed the usage of attention-based seq2seq model to generate the logical form of an input utterance. However, they use an LSTM model and not Transformer.
Graph-Based Approaches GraftNet Sun et al.  Saxena et al. (2020) presented an approach, EmbedKGQA, for joint learning, again using KG Embeddings, in the context of multi-hop relations. However, their approach is not truly a joint model as they perform answer candidate selection via the model, i.e., they arrive at the candidates before executing the model.
Our proposed approach has outperformed Pull-Net and EmbedKGQA on the MetaQA dataset, as shown in Section 5.

Proposed Architecture
In this section, we describe our proposed joint model (CQA-NMT) which is an encoder-decoder   Figure 3 illustrates a high-level view of the proposed model. Joint Model for KGQA In this paper, we extend BERT to generate path (or inference chains), perform sequence labeling and classification jointly. Details of each module are described next.

Entity Mention Detection Module:
To extract the mentioned entity(s) from NL query, we performed a sequence labeling task using BERT's hidden states (Figure 4). Sequence labeling is a seq2seq task that tags the input word sequence x = (w 1 , w 2 , ..., w T ) with the output label sequence y seq = (y 1 , y 2 , ..., y T ). In this paper, we augmented CQA-NMT to jointly infer the type of the mentioned entity(s) along with its(their) 'span'. We feed the final hidden states of the tokens h 2 , h 3 , ..., h T −1 , into a softmax layer to generate output sequence. Also, we ignore the h 1 and h T i.e., [CLS] and [SEP] tokens as they can never be a part of an entity(s) and are only required as a preprocessing step of BERT. Since BERT uses WordPiece tokenization, we assigned the same label to the other tokenized input corresponding to their first sub-token. For e.g., the output of BERT's Wordpiece tokenizer is 'Jim Hen ##son' for the input 'Jim Henson'. We assigned the labels for the tokenized output as 'B-Per I-Per I-per', i.e., the second sub-word '##son' was given the same label as the first sub-word 'Hen'. The output of the softmax layer is: y i etype = sof tmax(W e type .h i + b etype ) (1) where, h i is the hidden state corresponding to the i th token. 2. Entity Linking: The output of the Entity Mention Detection Module is a sequence of tokens along with its type (t i ) for a candidate entity. These mentioned entities still need to be linked to a KG node for traversal. In our work, we do not use any neural network for the linking process. Instead, we rely on an ensemble of string matching algorithms 2 and PageRank Page et al. (1999) to break the ties between candidate entities. The Entity Mention Detection Module outputs as 2 We used Levenshtein Distance and SequenceMatcher packages available in Python many entities as provided in a query and their associated type (t i ). To link the mentioned entity in the NL query, we extract the candidates from V(G) of type t i . We then apply 3 string-matching algorithms, similar to (Mohammed et al., 2017) , and take a majority voting to further break the ties. Finally, we apply the PageRank algorithm to link the mentioned entity with a KG entity. One way to understand the usability of the PageRank algorithm is to consider the notion of popularity. For e.g., if a user queries 'Where was Obama born?', the user here is more likely referring to the famous Barack Obama, compared to any other. A detailed description of the entity mention detection and the entity linking procedure is shown in Figure 4. 3. Path prediction Module: To generate the sequence of predicates for an input query, we augmented our architecture with a Transformer-based Vaswani et al. (2017) decoder which is often used in Neural Machine Translation (NMT) tasks. We define y path ={p 1 , p 2 , ..., p N } where each p i ∈ E(G). In our work, we do not constraint the number of predicates (multiple-hops) that are required to extract the final answer. Hence, an obvious choice was to use a decoder module which can stop generating the predicates once it has predicted the end-of-sentence ([EOS]) token (Figure 4). 4. Question Type and Answer Entity Type prediction module: In our work, we formulate the task of determining the question type and the AET as a classification task since we have a discrete set for both q type and Answer Entity Types. Using the hidden states of the first special token from BERT, i.e., [CLS], we predict: To jointly model all the task using a single architecture, we define our training objective as: p(y|x) = p(y etype , y path , y qtype , y τ |x) (4) p(y|x) = p(y qtype |x).p(y τ |x).p(y etype |x).p(y path |x) The path and AET components of CQA-NMT are defined as, p(y path |x) = T t=1 p(p t |p 1 , p 2 , ..., p t−1 |x). (7) . where, y qtype ∈ {factoid, count, boolean, simple} (8) y τ ∈ {entity types in KG} (9) For training we maximize the conditional probability p(y etype , y path , y qtype , y τ |x). The model is finetuned end-to-end via minimizing the cross-entropy loss.

Experiments and System details
In this section, we first introduce the datasets used for our experiments. We pre-process all NLQs (of all datasets) by downcasing and tokenizing.

Datasets, Metrices, and Baselines
LOCA Dataset: We introduce a new challenging dataset 'LOCA', which consists of 5010 entities, 42 unique predicates, and a total of 45,869 facts. The dataset has 3,275 one or multi-hop questions that have 0, 1, or more entities mentioned in the questions. It contains multiple question types like count, factoid, and boolean. For the questions with multiple entities, we used an operator ";" as a delimiter to separate paths corresponding to each entity (in Figure 1, query 5, 7, and 8). For the scope of this paper, we considered queries involving the only intersection which can be replaced with other operators like union, set-difference, etc. without loss of any generality. The operator ";" help us detect and predict the different topologies involved in an NL MetaQA: The dataset proposed in Zhang et al. (2018) consists of 3 different datasets namely, Vanilla, NTM, and, Audio Data. All the datasets contain single and multi-hop (maximum 3-hop) questions from the movie domain. For our experiments, we used the Vanilla and the NTM version of the datasets and the KB as provided in Zhang et al. (2018). Since, both versions of MetaQA do not consider the AET and question type, we assigned a default label to both the tasks.
Metrics: We used different metrics for different subtasks. Since a query can contain partially mentioned entities, we used F-score to evaluate mention and its type detection module. For Inference Chain (or Path prediction), question type, and, answer entity type prediction we use the accuracy measure. In Table 2, similar to prior works, we have used the Hits@1 to evaluate the query-answer accuracy. Baselines

Training Details
All the baselines and the proposed approach were trained on DGX 32GB NVIDIA GPU using Ten-sorFlow Abadi et al. (2015) and Texar Hu et al. (2018) libraries. For CQA-NMT, we used the small uncased version of pre-trained BERT Devlin et al. (2018) model. Adam Kingma and Ba (2014) optimizer was employed with a learning rate of 2e-5 for BERT and default for others. The training objective of each model was maximized using the cross-entropy loss and the best models were selected using the validation loss. Dropout values were set to .5 and were optimized as described in Srivastava et al. (2014). For BERT we used 10% of total training data for the warmup phase of BERT Vaswani et al. (2017). Finally, for the division of dataset into train, test, and, dev, we used the same split as provided by Zhang et al. (2018) for the MetaQA dataset and a ratio of 80-10-10 for LOCA dataset.

Main Results
In this section, we report the results of the experiments on the MetaQA and the LOCA dataset. Next, we provide insights into the model outputs and results of error-analysis performed on LOCA dataset.

LOCA
The experimental results for LOCA dataset are shown in the last row of table 2. The results affirm that the proposed approach outperforms the baselines. We observed that the baselines' inability to handle Duplicate KG Entity (Section 2 challenge 4) limits their performance. Additionally, the ability of the NMT Bahdanau et al. (2014) model to effectively handle complex and un-known topologies helped us retrieve answers with better accuracy for variable-hop (v-hop) queries.

MetaQA
The experimental results for MetaQA are shown in table 2. For Vanilla MetaQA, we achieved better answer accuracy on 1-hop and 3-hop settings. However, in a 2-hop setting, we were able to achieve comparable results to the state-of-the-art. An increment of about 2% and 4.9% Hits@1 can be seen in the 1-hop and 3-hop settings.
To obtain the performance of each baseline on v-hop (variable-hop) dataset, we re-use the existing models for 1-hop, 2-hop, and 3-hop and assume that there is an oracle which can redirect query to the correct model. Thus estimated accuracy of various approaches is shown in the 4 th row of Table  2, while the actual results on v-hop dataset are shown in the 5 th row. It is evident that CQA-NMT outperforms all the baselines on MetaQA dataset in variable hop setting.
To gauge the effectiveness and robustness of our model, we used the same models trained on vanilla MetaQA dataset and evaluated its performance on NTM MetaQA, i.e., in zero-shot setting. For this, we achieved better results on 1 and 3-hop. The worse performance of CQA-NMT on MetaQA-NMT(2-hop) can be because of zero shot setting. Because, as compared to VRN, we have not trained CQA-NMT on MetaQA-NTM dataset, we trained it on MetaQA vanilla dataset only.   Table 3: Effects for reducing the supervision from our approach. The numbers in italics are obtained without any supervision.

Further Results and Analysis
Advantage of Transformers: In the LSTM based implementation of mentioned entity detection, it could not detect different entity types for the same phrase "deep learning" in query 1 and 2 of Figure  1. However, in BERT-based approach it was able to. We therefore infer that such phenomenon could occur due to key features of BERT such as multi head attention, WordPiece embeddings, Positional embeddings, and/ or Segment embeddings. Moreover, in a different context, it was able to assign different types to the entities with the same mentions (Query 1 and 2 from Figure 1). Effects of using less annotations: To study the importance of annotation in our approach, we removed several components from our proposed approach and studied the effects (Table 3). We first studied CQA-NMT after removing all the supervision and used heuristics-based-approaches for AET and Mention Detection (both the approaches were taken from Mohammed et al. (2017)). The shortest path, similar to Sun et al. (2018Sun et al. ( , 2019, between the linked KG entity and AET, was then taken to retrieve the answers. This setting (row 1) results in the worst performance. In row 2, 3, and 4 of Table  3, we kept only one component of CQA-NMT as supervised and applied heuristics for others as mentioned above. As evident from these rows, mention detection plays a crucial role in extracting the correct answer (a jump in range of 2%-5% in answer accuracy). A similar analysis can be found in Dong and Lapata (2016) Table  3, we infer that joint training not only improves the scores of individual components (in range 15%-20%) but also, the overall answer accuracy. We observed that the challenges 5 and 6 from section 2 were handled significantly better after jointly training CQA-NMT for AET and mention detection (row 5 Motivation for PageRank: When we have more than one candidate entity for a mentioned entity, we want to choose the one with higher popularity (Sec 4). One of the most well established measure of popularity of nodes in graphs is PageRank. Therefore we have used it. Further, when more than one entity are mentioned in an NLQ, there can be more than one candidate entity for each of them. The graph-based approach also helps us choose the candidates that are well connected. We also experimented using other measures such as in-degree and out-degree of nodes. However, for LOCA dataset, we achieved an increment of 22% using PageRank on Entity Linking task, as compared to the in-degree and out-degree measures. PageRank also helped in reducing the challenges 1-2 from Sec. 2.

Retrieval of answer(s) from KG
The final objective of a KG-QA system is to retrieve the correct answer from KG against a query q. To this end, we use the outputs of the different components of CQA-NMT and feed them to complete the pre-written SPARQL sketchs. We defined a bunch of rules for different question-types and used a simple-mapping rules to map the queries to the sketches. For e.g., consider the query, q = "Who is working in automated regulatory compliance and has published a paper in NLP?". The output of CQA-NMT contains all the information that is required to form a structured query such as SPARQL. The outputs of CQA-NMT are: 1. Linked Entities: {e5: automated regulatory compliance (sub-area), e6: NLP (keyword)} 2. Inference Chain: key person; has paper, author 3. Answer Entity Type (AET): researcher.name 4. Question Type (q type ): Factoid After using the q type information, we fill a sketch using other outputs. The generated SPARQL query is: SELECT DISTINCT ?uri WHERE {<e5> <key person> <?uri> . <e6> <has paper> <?x> . <?x> <author> <?uri>}. Where, e5 and e6 are unique identities assigned to 'automated regulatory compliance' (of type subarea) and NLP (of type keyword).

Conclusion
We presented a complex version of the KGQA problem, which involves mention of multiple entities in the question. Multiple sequence of relationships combined in complex topologies, are required to answer such questions. It is evident that such questions, while required to be answered in real world industrial setting, cannot be answered using prior approaches. We propose a novel CQA-NMT model to answer such questions and have performed a detailed comparison of our approach with prior art on MetaQA and Loca datasets. We have shown that CQA-NMT not only solves more complex task, but also performs better on MetaQA dataset as compared to baseline approaches.