Knowledge Informed Semantic Parsing for Conversational Question Answering

Smart assistants are tasked to answer various questions regarding world knowledge. These questions range from retrieval of simple facts to retrieval of complex, multi-hops question followed by various operators (i.e., filter, argmax). Semantic parsing has emerged as the state-of-the-art for answering these kinds of questions by forming queries to extract information from knowledge bases (KBs). Specially, neural semantic parsers (NSPs) effectively translate natural questions to logical forms, which execute on KB and give desirable answers. Yet, NSPs suffer from non-executable logical forms for some instances in the generated logical forms might be missing due to the incompleteness of KBs. Intuitively, knowing the KB structure informs NSP with changes of the global logical forms structures with respect to changes in KB instances. In this work, we propose a novel knowledge-informed decoder variant of NSP. We consider the conversational question answering settings, where a natural language query, its context and its final answers are available at training. Experimental results show that our method outperformed strong baselines by 1.8 F1 points overall across 10 types of questions of the CSQA dataset. Especially for the “Logical Reasoning” category, our model improves by 7 F1 points. Furthermore, our results are achieved with 90.3% fewer parameters, allowing faster training for large-scale datasets.


Introduction
Knowledge base question answering (KBQA) has emerged as an important research topic over the past few years (Sun et al., 2018;Chakraborty et al., 2019;Sun et al., 2019; alongside with question answering over text corpora. In KBQA, world knowledge is given in the form of multi-relational graph databases * Equal contribution Lehmann et al., 2015) with millions of entities and interrelations between them. When a natural language question arrives, KBQA systems analyse relevant facts in the knowledge bases and derive the answers. In the presence of knowledge bases, question answering results are often time more interpretable and modifiable. For example, the question "Who started his career at Manchester United in 1992?" can be answered by fact triples such as ("David Beckham", member of sports team, "Manchester United"). This fact can be updated as the world knowledge changes while it might be non-trivial to achieve the same effect on text corpora. Likewise, KBQA systems face their own challenges (Chakraborty et al., 2019), especially in the real-world, conversational settings.
In real-world settings, KBQA systems need to perform multi-hop reasoning over chains of supporting facts and carry out various operations within the context of a conversation. For instance, answering the follow up question "When did he win his first championship?" might require identifying the player previously mentioned, all of his sport teams, the dates the sport teams won their championships. Then, argmax and filter operators are applied on the returned dates, yielding answers, i.e., "1999" for "David Beckham". Semantic parsing provides a weak supervision framework to learn to perform all these reasoning steps from just the question answer pairs. Semantic parsers define a set of rules (or grammar) for generating logical forms from natural language questions. Candidate logical forms are executable queries on the knowledge bases that yield the corresponding answers. Neural semantic parsers (NSPs) (Liang et al., 2016;Guo et al., 2018; employ a neural network to translate natural language questions into logical forms. NSPs have shown good performance on KBQA tasks (Liang et al., 2016; and further improved with reinforcement learning (Guo et al., 2018), multitask learning , and most recently meta-learning (Hua et al., 2020). Most previous works place more emphasis on modeling the reasoning behavior given in the questions than on interactions with the KB. In this work, we propose a KB-aware NSP variant (KISP) to fill in this gap.
One of the main challenges in learning KBQA systems is to adapt to structural changes of the relevant sub-knowledge base. Different reasoning behaviors might apply to similar questions with respect to different sub-knowledge bases. For example, a similar question "When did Tiger Woods win his first championship?" would require a different reasoning chain since he didn't participate in a sports team. Structural changes of the sub-KB is a common phenomenon due to the incompleteness nature of knowledge bases. In such cases, knowing the attributes and relations would inform NSPs with changes in logical forms with respect to specific relevant KB entities. To address this problem, we propose a NSPs with a KB-informed decoder that utilizes local knowledge base structure encoded in pre-trained KB embeddings. Our model collects all relevant KB artifacts and integrates their embeddings into each decoding step, iteratively. We also introduce an attention layer on a set of associated KB random walks as an k-steps look ahead that prevents the decoder from going into KB regions where generated logical forms are not executable.
Pre-trained KB embeddings were shown to improve multi-hop KBQA where answers are entities and no operations are involved (Saxena et al., 2020). In this paper, we demonstrate our work on the full KBQA settings with 10 question categories with no constraints on the answers (Saha et al., 2018). While (Saxena et al., 2020) evaluates 2-hop questions (Yih et al., 2016) and 2 and 3-hop questions with limited relation types (Zhang et al., 2018). Our model is also the first NSP variant that utilizes pre-trained features for logical forms generation. CARTON  uses an updated action grammar with stacked pointer networks. LASAGNE  is an extension of CARTON which further includes a graph attention network to exploit correlations between entities, predicates. Empirical results showed that our model improves upon the MaSP model , a strong baseline for CSQA dataset, by an absolute 1.8 F1, 1.5% accuracy two sets of questions respectively. Further, we find that by incorporating knowledge-graph information we can match the performance of much larger pre-trained encoder models while using 90.3% fewer parameters.

Background
We first formally describe our task and the Neural Semantic Parser (NSP) on which our work is based.
Knowledge Graph: Let E = {e 0 ...e N } be a set of given entities, and let R = {r 0 ...r M } be a set of relations. A knowledge graph G is a set of fact triples in E × R × E. A triple is represented as (h, r, t) where h, t ∈ E and r ∈ R. There is an extensive literature on representing the knowledge graph (Ji et al., 2020;Dai et al., 2020) that encode its semantics and structures. In this work, we use the pre-trained knowledge graph embeddings from Pytorch-BigGraph (Lerer et al., 2019).
Conversational Question Answering: In conversational question answering (CQA), the goal is to answer a question q within the context of the conversation history C. The question q and the history C are usually concatenated for handling ellipsis and coreference, forming the input X as [C; q]. At training time, a set of answering entities A is also given. The set A comprises entities that resolve to the answer depending on the answer's type. For example, answers of "Simple Question" are a list of entities, the answer of "Verification Question" is Yes/No, whether the set A is empty or not.

Neural Semantic Parser
Semantic parsing approach for CQA produces the answer set A by first generating a logical form Y. Formally, a logical form Y is a sequence of actions (y 1 , y 2 , ..., y n ) where the arguments of these actions can be constants (i.e., numbers, dates) or KG instances (i.e., entities, relations, types). The set of actions is defined by a grammar S . We consider the weak-supervision settings where the ground truth logical form Y is not available. Instead, we generate candidates for Y by performing BFS based on grammar S over the knowledge graph G and keeping the candidate logical forms that yield the answer set A (Guo et al., 2018). Given the input X and the labeled logical form Y, we train an encoder-decoder neural network to generate logical forms given the question and its conversational context.

Encoder:
The input X is formatted with BERT style. Then, it is fed into a Transformer-based encoder network ENC, producing a sequence of encoded states H = ENC(X) = (h [CLS] , h 0 , ...).
Decoder: The decoder is a Transformer-based model with attention. It takes the input representation from the encoder h [CLS] and the previous decoding state s i−1 to produce the target action y i .
Classifiers: The decoder is accompanied by a set of classifiers that predict the arguments for the decoder's actions at each decoding step. Our base NSP  employs FFNNs for relations and entity types classifiers; and pointer networks for entities and constants mentioned in the question. At each decoding step, these classifiers produce an entity e i , an entity type t i , a relation r i , and a constant c i . The logical form action at time step i is a tuple consists of y i and its arguments within {e i , t i , r i , c i } defined by the grammar S.

Knowledge-Informed Decoder
In this section, we introduce a knowledge-informed decoder that utilizes KG information to generate logical forms. We propose a knowledge injection layer that incorporates KG embeddings into the decoder state at each decoding step. To further inform the decoder with information about the expected structure of the KG, we propose an attention layer on random, k-hops knowledge walks from entities we encounter at each decoding step.

Knowledge Injection Layer(KIL)
NSP decoders only look at the encoded question and the previous state of decoding to decide the next action. Information of the KB instances (i.e., entities, types, or relations) being considered so far could improve this decision making process. Therefore, at the decoding step i where the action involves a KB instance, we propose a Knowledge Injection Layer (KIL) to propagate KB information to the sub-sequence steps. KIL takes in the KB classifiers predictions, incorporates their embeddings into the current encoding state and forwards it to the next decoding step. Eq. 1 becomes where v i−1 is the corresponding argument of y i−1 At step j where j > i, the decoder is informed of preceding KB instances, and is able to adapt to specific sub-KB structure. We find in cases where there multiple entities in context, having the right entity embedding at timestep j helps logical form in the upcoming steps. The entity embedding carries information about type of the entity, which our model is able to use more appropriate predicates for ambiguous mentions. We empirically show that KIL improves the exact match accuracy of the logical form attributes (logical form without KB).

Attention on KG Walks (AKW)
Now that the decoder is aware of the previous KB instances, it is also useful to peek at the possible reasoning chains coming out of the current decoding state. We do this to avoid reasoning paths that lead to an non-executable region where the logical form is invalid with respect to the KB. Therefore, we propose an attention look-ahead layer to inspect the upcoming KB structures before making the action prediction. We first generate a set of random walks on the KG from predicted entities and relations with the current decoding step. We then apply the attention look-ahead layer on these KG walks to obtain a representation of the expected KG structures. This representation is then fed back to the decoder to predict the action.
where v is one among entities in the question and p j is a random walk path on the KB starting from v, denoted as G(v). Here we use one hop random walks from predicates found in the input, though any type of random walk could be used. With the two proposed layers, our NSP decoder is now fully informed with the past and the future KB structures. We demonstrate that our decoder variant achieves better performance on various question categories. Furthermore, we show that the pre-trained KG embeddings do a significant heavy lifting on representing KB information within the decoder states, resulting in less model parameters and required training data.

Methods
MaSP

Experiments
Dataset and Evaluation We evaluate our approach on Complex Sequential Question Answering (CSQA) dataset. CSQA consists of 1.6M question answer pairs spread across 200K dialogues. Its has a 152K/16K/28K train, val, test split. More details on the dataset and evaluation metrics used are presented in Section A.1 of the Appendix.

Main Results
Our model 1 outperforms the MaSP model by 1.8 absolute points in F1-score for entity answer 1 Code: https://github.com/raghavlite/kisp questions and 1.5 absolute points in accuracy for the boolean/counting categories. KISP shows significant improvements in Table 1 compared to MaSP. In more complex question types such 'Logical Reasoning', 'Verification' which require to reason over multiple tuples in the KG and questions that requiring operations like counting, our model outperforms the baseline by more than 10% points. Table 1 compares with MaSP . Appendix has additional analysis. Our model also beats CARTON  in the entity answer questions despite them using an updated action grammar. For boolean, count type questions, the additional action vocabulary helps CARTON out perform our system. We will extend KISP to use this additional action vocabulary in the future.

Ablation Study
KG informed decoding with small models. A significant performance gain is expected in the smaller models by use of the knowledge graph information. We test this hypothesis by drastically reducing the size of the KISP encoder. This small version of KISP with only 9.7% of the baseline parameter slightly outperforms the baseline BERT model on overall F1-score. The gain comes from the fact that our models receive significant signal from KIL to make a more informed decision of valid actions/types in the next step even without a lot of knowledge from the encoder attention. Low resource settings. A semantic parsing system as described above typically requires annotated  golden logical forms for training. Logical form annotation is an resource intensive process (Berant et al., 2013;Zhong et al., 2017).
It is also a difficult process to use brute force computation to find these logical forms; also this process often results in spurious logical forms . This calls for models which can work with very few training examples. Hence we evaluate the effectiveness of KISP in low resource settings where only a fraction of data is used for training. Table  3 shows that KISP is able to outperform MaSP in these data constrained cases. The gap between MaSP and KISP widens in these low resource settings further justifying our model.    Impact of KIL and AKW To further understand how each classifier on the decoder is ben-efited from the knowledge graph, we look at the accuracies of these classifiers on the evaluation set. Table 4 displays accuracies of the five classifiers from Eq. 1 around logical form generation of different models.

10%
KISP does as better job at predicting the overall skeleton of the logical form -(all the various non e i , t i , r i , c i ) actions. We observe attending to knowledge graph improves the logical form skeleton up to 2.3 points. As shown in Example 3 and 4 of the Appendix, the count, f ilter actions within the logical form are better predicted by KISP. KIL provides entity-embedding for the entity of interest at current timestep this helps the model pick the right predicates in the following steps in ambiguous cases. Cases requiring reasoning benefit from seeing random walks around entities in contextprovided by AKW. These lead to better overall sketch accuracy.
KISP is also better at pointing to correct entity accuracy. Pointing to the right entity can has cascading effects on logical form prediction As shown by numbers in Table 4. KISP does a better job with entity pointer improving by almost 4 points. We attribute this to the KIL sytem of KISP which provides the KG embedding for entity of interest at given time step this helps the decoder's entity pointer mechanism.
Entity Linking Errors We follow Sheang (2019) in using a joint mention, type classifier followed by an inverse index entity linker on the input using the encoder representations. The entity pointer classifier described earlier sections looks at these entities in a sentence and points to one among them. We found that a large amount of errors had arisen from this inverse index. Recent work,  also points this and uses a better entity linker. Improving this module should significantly add to final performance and hence is a very interesting direction for future work.

Conclusion
We introduced a neural semantic parsing decoder that uses additional knowledge graph information for Conversational QA. Results show that KISP can significantly boost performance in complex multihop question types like logical reasoning questions. Our method can help improve over strong baseline methods like MaSP. Finally we presented a smaller version of our model that is approx 10x smaller without any performance degradation compared to a system that doesn't use KG informed decoding. Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.

A.1 Dataset and Evaluation
We evaluate our approach on Complex Sequential Question Answering (CSQA) dataset. CSQA consists of 1.6M question answer pairs spread across 200K dialogues. Its has a 152K/16K/28K train, val, test split. The dataset's knowledge graph is built on wikidata  and represented with triples. The KB consists of 21.2M triplets over 12.8M entities, 3054 distinct entity types, and 567 distinct predicates. There are 10 different question categories split into two groups. Answers to the first group of questions are a list of entities. Question categories of this group are evaluated by the macro F1 score between predicted entities and golden entities. Answers to question categories in the second group are either counts or boolean. This group is evaluated by accuracy. Overall scores for each group are the weighted averaged metrics of all the categories in the group. We refer the reader to Saha et al. (2018) for a more detailed understanding of different categories of questions. Following sections contain training/eval specifics.

A.2 Training details & Evaluation Metrics
We followed  to search for logical forms and create the training data. Exact hyperparameters used in the experiments are mentioned below. We followed Saha et al. (2018) for evaluation metrics. Macro Precision and Macro Recall were used when the answer was a list of entities.
For questions with answer type boolean/number, we use accuracy.

A.3 Training time Analysis
Training times of different models are in Table.  There are some known in-efficiencies in the code, some from design and others conceptual. We in- tend to improve training time in future work by incorporating more e2e methods that will reduce GPU2CPU & CPU2GPU communication and also through some design changes in the short term.

A.4 Logical form Analysis
We identify examples to show performance improvement in KISP models, in predicting the correct answer and logical form. As shown in Table 6 below, KISP models for these examples do a better job at sketch, entity, num, type, predicate classification compared to MaSP. The coloured images in Figure 2-6 show the differences between MaSP and KISP models. For each example we show the golden logical form tree(also predicted by one of the KISP models), MaSP's logical form and the mistakes made by the baseline in color red.