Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks

This paper addresses the task of (complex) conversational question answering over a knowledge graph. For this task, we propose LASAGNE (muLti-task semAntic parSing with trAnsformer and Graph atteNtion nEworks). It is the first approach, which employs a transformer architecture extended with Graph Attention Networks for multi-task neural semantic parsing. LASAGNE uses a transformer model for generating the base logical forms, while the Graph Attention model is used to exploit correlations between (entity) types and predicates to produce node representations. LASAGNE also includes a novel entity recognition module which detects, links, and ranks all relevant entities in the question context. We evaluate LASAGNE on a standard dataset for complex sequential question answering, on which it outperforms existing baselines averaged on all question types. Specifically, we show that LASAGNE improves the F1-score on eight out of ten question types; in some cases, the increase is more than 20% compared to state of the art (SotA).


Introduction
Since their inception in the late 2000s, publicly available Knowledge Graphs (e.g., DBpedia (Lehmann et al., 2015) and Yago (Suchanek et al., 2007)) have been widely used as a source of knowledge in several natural language processing (NLP) tasks such as entity linking, relation extraction, fact-checking, and question answering. Question answering (QA), in particular, is an essential task that maps a user natural language question  (Saha et al., 2018). to a query over a knowledge graph (KG) to retrieve the correct answer (Singh et al., 2018). With the increasing popularity of intelligent personal assistants (e.g., Alexa, Siri), the research focus has been shifted to conversational question answering that involves multi-turn dialogues, incorporating the phenomenon of anaphora and ellipses (Christmann et al., 2019;Shen et al., 2019)(c.f. Figure 1).
Conversational QA is often realised by using semantic parsing approaches, mapping an utterance to a logic form for extracting answers from a KG (Guo et al., 2018;Shen et al., 2019). The state of the art for semantic parsing approaches decomposes the semantic parsing process into two stages (Shen et al., 2019). First, a logical form is generated based on low-level features and then the missing details are filled by considering both the question and the template. Other approaches (Dong and Lapata, 2016;Guo et al., 2018;Liang et al., 2017) first employ an entity linking model to identify entities in the question and subsequently use another model to map the question to a logical form. (Zhang et al., 2018;Shen et al., 2019) point out that the modular approaches suffer from the common issue of error propagation along the QA pipeline, resulting in accumulated errors. To mitigate these errors, Shen et al. (2019) proposed a multi-task framework, where a pointer-equipped semantic parsing model was designed to resolve coreference in conversations and empower joint learning with a type-aware entity detection model. Furthermore, the authors used simple classifiers to predict the required (entity) types and predicates for the generated logical forms. In this paper, we argue that Shen et al. (2019) model (the current SotA) has the following shortcomings: 1) the (entity) type and predicate classifiers share no common information, except for the supervision signal propagated to them. 2) Hence, due to missing common information, the model can produce ambiguous results, since the classifiers can predict entities and predicates that do not correlate with each other.

Approach and Contributions:
We tackle the problem of conversational (complex) question answering over a large-scale knowledge graph. We propose LASAGNE (muLti-task semAntic parSing with trAnsformer and Graph atteNtion nEtworks) -a multi-task learning framework consisting of a transformer model extended with Graph Attention Networks (GATs) (Veličković et al., 2018) for multi-task neural semantic parsing. Our framework handles semantic parsing using the transformer (Vaswani et al., 2017) model similar to previous approaches. However, in LASAGNE we introduce the following two novel contributions: 1) the transformer model is supplemented with a Graph Attention Network to exploit the correlations between (entity) types and predicates due to its message-passing ability between the nodes. 2) We propose a novel entity recognition module that detects, links, filters, and permutes all relevant entities. (Shen et al., 2019) uses a pointer equipped decoder that learns and identifies the relevant entities for the logical form using only the encoder's information. In contrast, we use both sources of information, i.e., the entity detection module and the encoder, to filter and permute the relevant en-tities for a logical form. This avoids re-learning entity information in the current question context and relies on the entity detection module's information. Our empirical results show that the proposed novel contributions lead to substantial performance improvements.
LASAGNE achieves the state of the art results in 8 out of 10 question types on the Complex Sequential Question Answering (CSQA) (Saha et al., 2018) dataset consisting of conversations over linked QA pairs. The dataset contains 200K dialogues with 1.6M turns, and over 12.8M entities from Wikidata 1 . Our implementation, the annotated dataset with the proposed grammar, and the results are publicly available to facilitate reproducibility and reuse 2 .
The structure of the paper is as follows: Section 2 summarises the related work. Section 3 presents the proposed LASANGE framework. Section 4 describes the experiments, including the experimental setup, the results, the ablation study and error analysis. We conclude in Section 5.

Related Work
We point to the survey by (Gao et al., 2018) that provides a holistic overview of neural approaches in conversational AI. In this paper, we stick to our closely related work, i.e., semantic parsing-based approaches in conversations. (Liang et al., 2017) introduce a neural symbolic machine (NSM) extended with a key-value memory network, where keys and values are the output of the sequence model in different encoding or decoding steps. The NSM model is trained using the REINFORCE algorithm with weak supervision and evaluated on the WebQuestionsSP dataset (Yih et al., 2016). (Saha et al., 2018) propose a hybrid model of the HRED model (Serban et al., 2016) and the keyvalue memory network model (Miller et al., 2016). The model consists of three components. The first one is the Hierarchical Encoder, which computes a representation for each utterance. The next module is a higher-level encoder that computes a representation for the context. The second component is the Key-Value Memory Network. It stores each of the candidate tuples as a key-value pair where the key contains the concatenated embedding of the relation and the subject. In contrast, the value contains Action Description set → find(e, p) set of objects part of the triples with subject e and predicate p set → find reverse(e, p) set of subjects part of the triples with object e and predicate p set → filter type (set, tp) filter the given set of entities based on the given type set → filter multi types(set1, set2) filter the given set of entities based on the given set of types dict → find tuple counts(p, tp1, tp2) extracts a dictionary, where keys are entities of type1 and values are the number of objects of type2 related with p dict → find reverse tuple counts(p, tp1, tp2) extracts a dictionary, where keys are entities of type1 and values are the number of subjects of type2 related with p set → greater (dict, num) set of those entities that have greater count than num set → lesser (dict, num) set of those entities that have lesser count than num set → equal(dict, num) set of those entities that have equal count with num set → approx(dict, num) set of those entities that have approximately same count with num set → atmost(dict, num) set of those entities that have at most same count with num set → atleast(dict, num) set of those entities that have at least same count with num set → argmin(dict) set of those entities that have the most count set → argmax(dict) set of those entities that have the least count boolean → is in(entity, set) check if the entity is part of the set number → count(set) count the number of elements in the set set → union(set1, set2) union of set1 and set2 set → intersection(set1, set2) intersection of set1 and set2 set → difference(set1, set2) difference of set1 and set2 the embedding of the object. The last component is the decoder used to create an end-to-end solution and produce multiple types of answers. (Guo et al., 2018) present a model that converts an utterance in conversation to a logical form. The model follows a flexible grammar, in which the generation of a logical form is equivalent to predicting a sequence of actions. A dialogue memory management is proposed and integrated into the model, so that historical entities, predicates, and action sub-sequences can selectively be replicated. (Shen et al., 2019) proposed the first multi-task learning framework that learns type-aware entity detection and pointer-equipped logical form generation simultaneously. The multi-task learning framework takes advantage of the supervision from the subtasks.

LASAGNE
In a conversation, the input data consists of utterances u and their answers a, extracted from the knowledge graph. Our framework LASAGNE employs a multi-task semantic parsing approach. In particular, it maps the utterance u to a logical form z, depending on the conversation context. Figure 2 shows the architecture of LASAGNE.

Grammar
For the semantic parsing task, we propose a grammar that can be used to capture the entire context of the input utterance with the minimum number of actions. Table 1 illustrates the complete grammar with all the defined actions. We considered the work by (Guo et al., 2018) as a starting point for generating them, however, we have updated many of the semantic actions. For instance, for a couple of actions, we also define their reverse occurrence (e.g. find, find reverse)).

Transformer
To translate the input conversation into a sequence of actions (logical form), we utilise a transformer model (Vaswani et al., 2017). Specifically, the transformer here aims to map a question q, that is a sequence x = {x 1 , . . . , x n }, to the answer label l, that can be also defined as a sequence y = {y 1 , . . . , y m }, by modelling the conditional probability p(y|x).

Input and Word Embedding
We have to incorporate the dialog history from previous interactions as an additional input to our model for handling coreference and ellipsis. To do so, we consider the following utterances for each turn: 1) the previous question, 2) the previous answer, and 3) the current question. Utterances are separated from one another by using a [SEP ] token. At the end of the last utterance, we append a context token [CT X], which is used as the semantic representation for the entire input question. In the next step, given an utterance q containing n words {w 1 , . . . , w n } we first tokenise the conversation context using WordPiece tokenization (Wu et al., 2016), and after that, we use the pre-trained model GloVe (Pennington et al., 2014) to embed the words into a vector representation space of di- It consists of three modules: 1) A semantic parsing-based transformer model, containing a contextual encoder and a grammar guided decoder using the grammar defined in Table 1. 2) An entity recognition module, which identifies all the entities in the context, together with their types, linking them to the knowledge graph. It filters them based on the context and permutes them, in case of more than one required entity. Finally, 3) a graph attention-based module that uses a GAT network initialised with BERT embeddings to incorporate and exploit correlations between (entity) types and predicates. The resulting node embeddings, together with the context hidden state (h ctx ) and decoder hidden state (d h ), are used to score the nodes and predict the corresponding type and predicate. mension d 3 . Our word embedding model provides us with a sequence x = {x 1 , . . . , x n } where x i is given by, x i = GloV e(w i ) and x i ∈ R d .

Contextual Encoder
The word embeddings x, are forwarded as input to the contextual encoder, which uses the multi-head attention mechanism described by (Vaswani et al., 2017). The encoder here outputs the contextual embeddings ∈ R d and it can be defined as: where θ (enc) are the encoder's trainable parameters.

Grammar-Guided Decoder
We use a grammar guided decoder for generating the logical forms. The decoder also employs the where h (dec) t is the hidden state in time step t, θ (dec) are the decoder trainable parameters, W (dec) ∈ R |V (dec) |×d are the linear layer weights, and p (dec) t ∈ R |V (dec) | is the probability distribution over the decoder vocabulary in time step t. The |V (dec) | denotes the decoder's vocabulary size.

Entity Recognition Module
The entity recognition module is composed of two sub-modules, where each module is trained using a different objective.

Entity Detection and Linking
Entity Detection It aims to detect and link the entities to the KG. The module is inspired by (Shen et al., 2019) and performs type-aware entity detection by using BIO sequence tagging jointly with entity type tagging. Specifically, the entity detection vocabulary is defined as where T P i denotes the i-th entity type label, N (tp) stands for the number of the distinct entity types in the knowledge graph and |V (ed) | = 2×N (tp) +1. For performing the sequence tagging task we use an LSTM (Hochreiter and Schmidhuber, 1997) and the module is defined as: where h (enc) is the encoder hidden state, θ (l) are the LSTM layer trainable parameters, h (l) t is the LSTM hidden state for time step t, W (l) ∈ R |V (ed) |×d are the linear layer weights and p (ed) t are the entity detection module prediction for time step t. |V (ed) | denotes the entity detection vocabulary size.
Entity Linking Once the entity BIO labels and their types are recognised, the next steps for the entity linking are: 1) the BIO labels are used to locate the entity spans from the input utterances.
2) An inverted index built for the knowledge graph entities is used to retrieve candidates for each predicted entity span. Finally, 3) the candidate lists are filtered using the predicted (entity) types. From the filtered candidates, the first entity is considered as correct.

Filtering and Permutation
After finding all the input utterances' entities, we perform two additional tasks in order to use entities in the generated logical form. First, we filter the relevant entities, and then we need to permute the entities in the order required for the logical form. The module receives as an input the concatenation of the hidden states of the encoder h (enc) and the hidden states of the LSTM h (l) from the entity detection model. The module here learns to assign index tags to each input token. We define the module vocabulary as V (ef ) = {0, 1, . . . , m} where 0 is the index assigned to the context entities that are not considered. The remaining values are indices that permute our entities based on the logical form. Here, m is the total number of indices based on the maximum number of entities from all logical forms. Overall, our filtering and permutation module is modelled using a feed-forward network with two linear layers separated with a Leaky ReLU activation function and appended with a softmax. Formally we define the module as: where W (ef 1 ) ∈ R d×2d are the weights of the first linear layer and h (ef ) t is the hidden state of the module in time step t. W (ef 2 ) ∈ R |V (ef ) |×d are the weights of the second linear layer, |V (ef ) | is the size of the vocabulary and p (ef ) t denotes the probability distribution over the tag indices for the time step t.

Graph Attention-Based Module
A knowledge graph (KG) can be denoted as a set of triples K ⊆ E × R × E where E and R are the set of entities and relations respectively. To build the (local) graph, we consider the relations and the types of entities that are linked with these relations in the knowledge graph K. We define a graph G = {T ∪ R, L} where T is the set of types, R is the set of relations and L is a set of links (tp 1 , r) and (r, tp 2 ) such that ∃(e 1 , r, e 2 ) ∈ K where e 1 is of type tp 1 and e 2 is of type tp 2 .
To propagate information in the graph and to project prior KG information into the embedding space, we use the Graph Attention Networks (GATs) (Veličković et al., 2018).
We initialise each node embedding h (g) = {h n } using pretrained BERT embeddings, and n = |T ∪ R|. A GAT layer uses a parameter weight matrix, and self-attention, to produce a transformation of input representations and θ (g) are the trainable parameters. We model the task of predicting the correct type or predicate in the logical form as a classification task over the nodes in graph G, given the current conversational context and decoder hidden state. For each time step t in the decoder, we calculate the probability distribution p (g) t over the graph nodes as: where h (g) ∈ R d×n and h (c) t is a linear projection of the concatenation of the context representation and the decoder hidden state, given as follows, and W (g) ∈ R d×2d .

Learning
The framework consists of four trainable modules, grammar guided decoder, entity detection, filtering and permutation, and the GAT-based module for types and predicates. Every module consists of a loss function that contributes to the overall performance of the framework, as shown in Section 4.3. To account for multi-tasking, we perform a weighted average of all the single losses: where λ 1 , λ 2 , λ 3 , λ 4 are the relative weights, which are learned during training by taking into account the difference in magnitude between losses by incorporating the log standard deviation (Armitage et al., 2020;Cipolla et al., 2018). L dec , L ed , L ef , and L g are the respective negative log-likelihood losses of the grammar guided decoder, entity detection, filtering and permutation, and GAT-based modules. These losses are defined as follows: 4 For more details about GAT please refer to the appendix.
where n and m are the length of the input utterance x and the gold logical form, respectively. y (dec) k ∈ V (dec) are the gold labels for the decoder, y (ed) j ∈ V (ed) are the gold labels for entity detection, y (ef ) j ∈ V (ef ) are the gold labels for filtering and permutation, and y (g) k ∈ {T ∪ R} are the gold labels for the GAT-based module. The model benefits from multiple supervision signals from each module, and this improves the performance in the given task.

Experimental Setup
Datasets We use the Complex Sequential Question Answering (CSQA) dataset 5 (Saha et al., 2018). CSQA was built on the large-scale knowledge graph Wikidata. Wikidata consists of 21.2M triples with over 12.8M entities, 3,054 entity types, and 567 predicates. The CSQA dataset consists of around 200K dialogues where each partition -train, valid, test contains 153K, 16K, 28K dialogues, respectively. The questions involve complex reasoning to determine the correct answers.
Model Configurations We incorporate a semiautomated preprocessing step to annotate the CSQA dataset with gold logical forms. For each question type and subtype in the dataset, we create a general template with a pattern sequence that the actions should follow. Thereafter, we follow a set of rules to create the specific gold logical form that extracts the gold sequence of actions based on the type of question for each question. The actions used for this process are the ones in Table 1. For all the modules in the LASAGNE framework, we employ an embedding dimension of 300. We utilise the transformer model with six heads for the multi-head attention model with two layers. For the optimisation, we use the Noam optimiser proposed by (Vaswani et al., 2017), where authors use an Adam optimiser (Kingma and Ba, 2015) with several warmup steps for the learning rate. Please refer to the appendix submitted with the paper for more details.
Models for Comparison We compare the LASAGNE framework with the last three baselines that have been evaluated on the employed dataset. The first baseline is (Saha et al., 2018) where authors introduce the HRED+KVmem model. The second baseline is D2A (Guo et al., 2018), which  uses a semantic parsing approach based on a seq2seq model. Finally, the current state of the art is MaSP (Shen et al., 2019), which is also a semantic parsing approach. Evaluation Metrics We use the same metrics as employed by the authors of the CSQA dataset (Saha et al., 2018) as well as the previous baselines. The "F1-score" is used for questions that have an answer composed of a set of entities. The "Accuracy" metric is used for the question types whose answer is a number or a boolean value (YES/NO). We also provide an overall score for each evaluation metric and their corresponding question categories.

Results
Table 2 summarises the results comparing the LASAGNE framework against the previous baselines. LASAGNE outperforms the previous baselines weighted average on all question types (The row "overall" in the Table 2). Furthermore, LASAGNE is a new SotA in 8 out of 10 question types, and in some cases, the improvement is up to 31 percent. What worked in our case? For question types that require more than two entities for reasoning, such as Logical Reasoning (All) and Verification (Boolean), LASAGNE performs considerably better (+20.79% and +18.23% respectively). This is mainly due to the proposed entity recognition module. Furthermore, for question types that require two or more (entity) types and predicates, such as Quantitative Reasoning (All), Quantitative Reasoning (Count) and Comparative Reason-ing (Count) LASAGNE also outperforms MaSP (+12.92%, +11.79% and +31.08% respectively). Here, the improvement is due to the graph attentionbased module, which is responsible for predicting the relevant (entity) types and predicates. Another interesting result is that LASAGNE also performs better in two out of three Simple Question involving one entity and one predicate categories. The performance shows the robustness of LASAGNE. What did not work in our case? LASAGNE noticeably under-performs on the Clarification question type, where MaSP retains the state-of-theart. The main reason is the spurious logical forms during the annotation process which has further impacted the Simple Questions (Ellipses) performance.

Ablation Study
Effect of GAT and Multi-task Learning Table 3 summarises the effectiveness of the GAT-based module and the multi-task learning. We can observe the advantage of using them together in LASAGNE. To show the effectiveness of GATbased module, we replace it with two simple classifiers, one for each predicate and type categories. We can observe that the performance drops significantly for the question types that require multiple entity types and predicates (e.g. Quantitative Reasoning (All), Quantitative Reasoning (Count) and Comparative Reasoning (Count)). When we exclude the multi-task learning and train all the modules independently, there is a negative impact on all question types. In LASAGNE, the filtering and  permutation module, along with the GAT-based module, is heavily dependent on the supervision signals received from the previous modules. Therefore it is expected that without the multi-task learning, LASAGNE will underperform on all question types, since each module has to re-learn inherited information.   Table 4 illustrates the task accuracy of LASAGNE. The Entity Detection task has the lowest accuracy (86.75%). The main reason here is the errors in the entity type prediction. On the other hand, for all other tasks, we have accuracy above 90%.

Task Analysis
Effect of Filtering and Permutation For justifying the effectiveness and superior performance of LASAGNE's filtering and permutation module, we compare the overall performance of the entity recognition module to the corresponding module from MaSP. Please note, entity detection modules in both frameworks adopt a similar approach as defined in section 3.3. In Table 5 we can see that the MaSP entity recognition module provides an overall accuracy of 79.8% on test data, while our mod-ule outperforms it with an accuracy of 92.1%. The main reason for the under-performance of MaSP is that it uses only token embeddings without any entity information. In contrast, our approach avoids re-learning entity information in the question context and relies on the entity detection module's information.

Error Analysis
For the error analysis, we randomly sampled 100 incorrect predictions. We detail the reasons for two types of errors observed in the analysis: Entity Ambiguity Even though our entity detection module assigns (entity) types to each predicted span, entity ambiguity remains the biggest challenge for our framework. For instance, for the question, "Who is associated with Jeff Smith ?" LASAGNE entity detection module correctly identifies "Jeff Smith" as an entity surface form and correctly assigns the (entity) type "common name". However, the Wikidata knowledge graph contains more than ten entities with exactly the same label and type. Our entity linking module has difficulties in such cases. Wikidata entity linking is a newly emerging research domain that has its specific challenges such as entities sharing the same labels, user-created non-standard entity labels and multi-word entity labels (up to 62 words) (Mulang et al., 2020b). Additional entity contexts, such as entity descriptions and other KG contexts, could help resolve the Wikidata entity ambiguity (Mulang et al., 2020a).
Spurious Logical Form For specific question categories, we could not identify gold actions for all utterances. Therefore spurious logical form is a standard error that affects LASAGNE. Specifically, we have spurious logical forms for categories such as "Comparative, Quantitative, and Clarification" but still can achieve SotA in the comparative and quantitative categories.

Conclusions
In this article, we focus on complex question answering over a large-scale knowledge graph containing conversational context. We provide a transformer-based framework to handle the task in a multi-task semantic parsing manner. At the same time, we propose a named entity recognition module for entity detection, filtering, and permutation. Furthermore, we also introduce a graph attentionbased module, which exploits correlations between (entity) types and predicates for identifying the gold ones for each particular context. We empirically show that our model achieves the best results for numerous question types and also overall. Our ablation study demonstrates the effectiveness of the multi-task learning and of our graph-based module. We also present an error analysis on a random sample of "wrong examples" to discuss our model's weaknesses. For future work, we believe that reinforcement learning is a viable alternative to explore complex conversational question answering without gold annotations.

A Grammar
We propose a new grammar to annotate the dataset with a gold logical form to perform the semantic parsing task. We consider the work by (Guo et al., 2018) as a starting point for generating them. While we differ in many actions regarding their semantic and therefore their implementation. Our goal was to define more precise and richer actions, which gives us a more flexible grammar in terms of being used to annotate a wider range of question's complexities. For instance, for a couple of actions, we also define their reverse occurrence (e.g. find, find reverse)). We do this in order to match the knowledge graph triple direction (subject-predicate-object). In some questions, we might have the subject or the object entity. Having both normal and reverse actions helps us to identify directly the correct answer based on the action the model predicted. Furthermore, we also define actions that do not exist in (Guo et al., 2018). Some of them are find tuple counts, atmost, atleast. Table 1 illustrates the complete grammar with all the defined actions. Following (Lu et al., 2008), we define each action with a function that can be executed on the knowledge graph. Finally, in order to execute a sequence of actions, we have to parse it into a tree structure. There our executor starts from the tree leaves and it recursively executes the leftmost non-terminal node until the whole tree is complete. Table 9 shows examples from different question types in the CSQA dataset and the logical forms generated from our model. As we can see, our actions can cover reasoning for every question type by following certain patterns depending on them. The sequences can cover all the different complexities of the questions. For example, the logical forms pattern of Simple Questions to Quantitative or Comparative is slightly different due to increased complexity of the latter questions. The reasoning over Quantitative or Comparative question involves more actions in order to reach the correct answer.

B Case Study
For the question type Simple Question (Direct), we can see the question "Which administrative territory is the birthplace of Antonio Reguero ?". The correct logical form for this example is "filter type(find(Antonio Reguero, place of birth), administrative territorial entity)". Here we can distinguish two different actions; the first one is the filter type and the other one is the find action. The find action receives as input an entity subject and a predicate and provides the set of object entities from the Knowledge Graph. Whereas, the filter type action receives as input a set of entities along with an entity type and results to a set of entities that belong to that particular entity type.   (Vaswani et al., 2017). Our model dimension is d model = 300, with a total number of H = 6 heads and L = 2 layers. The inner feed-forward linear layers have dimension d f f = 600. Following the base transformer parameters, we apply residual dropout (Srivastava et al., 2014) to the summation of the embeddings and the positional encodings in both encoder and decoder stacks with a rate of 0.1. The entity detection module has a dimension of 300. Our base LSTM here is followed with a LeakyReLU, dropout, and a linear layer. The output of the linear layer is the module prediction while the LSTM hidden state is propagated to the filtering and permutation layer. The filtering and permutation module receives an input of dimension 600 where here a linear layer is responsible to reduce it to 300 which is the framework dimension. Like in the previous module, a LeakyReLU, dropout, and a linear layer are used for the final predictions. Finally, for the GAT-based module, we use pre-trained BERT embeddings for type and predicate labels. Hence the input dimension on this module is 3072. The GAT layer will produce representations with an embedding size of 300. Next, multiple linear, dropout, and LeakyReLU layers are used to produce the final predictions. Figure 3 shows the aggregation process of graph attention layer between the (entity) types and predicates from Wikidata. The KB types and predicates are the nodes of the graph, and there exist an edge only between types and predicates with the condition that there exist a triple which involved the predicate and an entity of that type. We use GATs (Veličković et al., 2018) to capture dif-  Table 7: Precision and recall comparison with baselines. Figure 3: The aggregation process of graph attention layer between the (entity) types and predicates from Wikidata knowledge graph. The dashed lines represent an auxiliary edge, while a ij represents relative attention values of the edge. We also incorporate the predicates (relations) as nodes of the graph instead of edges. ferent level of information for a node, based on the neighborhood in the graph. We denote with

D Graph Attention Networks
n } the initial representations of the nodes, which will also be the input features for the GAT layer. To denote the influence of node j to the node i, an attention score e ij is computed as e ij = a(Wh , where W is a parameterized linear transformation, and a is an attention function. In our case, we follow the GAT paper, and compute e ij score as follows, where a ∈ R 2d is a single-layer feedforward network, and || denotes concatenation. This attention scores are normalized using a softmax function and producing the α ij scores for all the edges in a neighborhood. These normalized attention scores are used to compute the output features h (g) i of a node in a graph, by applying a linear combination of all the nodes in the neighborhood as below, where σ is a non-linear function. Following (Veličković et al., 2018) and (Vaswani et al., 2017) we also apply a multi-head attention mechanism and compute the final output features as, where K is equal to the number of heads, and α k ij , W k are the corresponding attention scores and linear transformation by the k-th attention mechanism. During our experiments, we found out the K = 2 was sufficient for our model. Table 7 summarizes precision and recall results comparing LASAGNE framework against the previous baselines. Furthermore, a detailed task analysis for each task on each question type is illustrated on Table 8.  Q1: How many alphabets are used as the scripts for more number of languages than Jawi alphabet ? count(greater(count( filter type(find(Jawi alphabet, writing system), language)), find tuple counts(writing system, alphabet, language))) Comparative Reasoning (All) Q1: Which occupations were more number of publications and works mainly about than composer ?

E Experiments
greater(union( find reverse tuple counts(main subject, occupation, publication), find reverse tuple counts(main subject, occupation, work)), count(filter multi types(find reverse(composer, main subject), publication, work))) Verification Q1: Was Geir Rasmussen born at that administrative territory ? is in(find(Geir Rasmussen, place of birth), Chicago) Table 9: Examples from the CSQA dataset (Saha et al., 2018) annotated with gold logical form.