Expanding End-to-End Question Answering on Differentiable Knowledge Graphs with Intersection

End-to-end question answering using a differentiable knowledge graph is a promising technique that requires only weak supervision, produces interpretable results, and is fully differentiable. Previous implementations of this technique (Cohen et al, 2020) have focused on single-entity questions using a relation following operation. In this paper, we propose a model that explicitly handles multiple-entity questions by implementing a new intersection operation, which identifies the shared elements between two sets of entities. We find that introducing intersection improves performance over a baseline model on two datasets, WebQuestionsSP (69.6% to 73.3% Hits@1) and ComplexWebQuestions (39.8% to 48.7% Hits@1), and in particular, improves performance on questions with multiple entities by over 14% on WebQuestionsSP and by 19% on ComplexWebQuestions.


Introduction
Knowledge graphs (KGs) are data structures that store facts in the form of relations between entities. Knowledge Graph-based Question Answering (KGQA) is the task of learning to answer questions by traversing facts in a knowledge graph. Traditional approaches to KGQA use semantic parsing to parse natural language to a logical query, such as SQL. Annotating these queries, however, can be expensive and require experts familiar with the query language and KG ontology.
End-to-end question answering (E2EQA) models overcome this annotation bottleneck by requiring only weak supervision from question-answer pairs. These models learn to predict paths in a knowledge graph using only the answer as the training signal. In order to train an E2EQA model in a fully differentiable way,  proposed differentiable knowledge graphs as a way to represent KGs as tensors and queries as differentiable mathematical operations. Previous implementations of E2EQA models using differentiable knowledge graphs  have focused on single-entity questions using a relation following operation. For example, to answer "Where was Natalie Portman born?", the model could predict a path starting at the Natalie Portman entity and following a place of birth relation to the correct answer.
While this follow operation handles many questions, it often struggles on questions with multiple entities. For example, to answer "Who did Natalie Portman play in Star Wars Episode II?", it is not enough to identify all the characters Natalie Portman has played, nor all the characters in Star Wars Episode II. Instead, the model needs to find what character Natalie Portman has played that is also a character in Star Wars. This can be solved through intersection. An intersection of two sets A and B returns all elements in A that also appear in B. This example is illustrated in Figure 1.
In this paper, we propose to explicitly handle multiple-entity questions in E2EQA by learning intersection in a dynamic multi-hop setting. Our intersection models learn to both follow relations and intersect sets of resulting entities in order to arrive at the correct answer. We find that our mod-els score 73.3% on WebQuestionsSP and 48.7% on ComplexWebQuestions, and in particular, improve upon a baseline on questions with multiple entities from 56.3% to 70.6% on WebQuestionsSP and 36.8% to 55.8% on ComplexWebQuestions.

Related Works
Traditional approaches to KGQA have used semantic parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005) to parse natural language into a logical form. Collecting semantic parsing training data can be expensive and is done either manually (Dahl et al., 1994;Finegan-Dollak et al., 2018) or using automatic generation (Wang et al., 2015) which is not always representative of natural questions (Herzig and Berant, 2019).
Another line of work in KGQA uses embedding techniques to implicitly infer answers from knowledge graphs. These methods include GRAFT-Net (Sun et al., 2018), which uses a graph convolutional network to infer answers from subgraphs, PullNet (Sun et al., 2019), which improves GRAFT-Net by learning to retrieve subgraphs, and EmbedKGQA (Saxena et al., 2020), which incorporates knowledge graph embeddings. EmQL  is a query embedding method using set operators, however these operators need to be pretrained for each KB. TransferNet (Shi et al., 2021) is a recent model that trains KGQA in a differentiable way, however it stores facts as an N x N matrix, where N is the number of entities, so it runs into scaling issues with larger knowledge graphs.
Our approach to KGQA based on  has three main advantages: • Interpretability: Models based on graph convolutional networks (PullNet, GRAFT-Net) get good performance but have weak interpretability because they do not output intermediate reasoning paths. Our approach outputs intermediate paths as well as probabilities.
• Scaling:  show that differentiable KGs can be distributed across multiple GPUs and scaled horizontally, so that different triple IDs are stored on different GPUs, allowing for scaling to tens of millions of facts. Other methods using embedding techniques (EmbedKGQA, EmQL) or less efficient representations (TransferNet) are more memory intensive and not easily distributed.
• No retraining for new entities: Models based on latent representations of entities (Em-bedKGQA, EmQL) get state-of-the-art performance, however they need to be retrained whenever a new entity is added to the KG (e.g., a new movie) to learn updated embeddings. Our approach can incorporate new entities easily without affecting trained models.

The Baseline Model
Our baseline model, which we call Rigel-Baseline, is based on differentiable knowledge graphs and the ReifiedKB model . We provide an overview here but full details can be found in the original paper.

Differentiable Knowledge Graphs
Assume we have a graph: where E is the set of entities, R is the set of relations, and (s, p, o) is a triple showing that the relation p holds between a subject entity s and an object entity o. To create a differentiable knowledge graph, we represent the set of all triples in three matrices: a subject matrix (M s ), relation matrix (M p ), and object matrix (M o ). A triple (s, p, o) is represented across all three matrices at a given index.
Since the knowledge graph is represented as matrices, interacting with the knowledge graph is done with matrix operations. ReifiedKB was implemented with a follow operation: Given an entity vector x t−1 ∈ R N E at t − 1-th time step and a relation vector r t ∈ R N R , the resulting entity vector x t is computed by Equation 2 where is elementwise multiplication.

Model
The Rigel-Baseline model is composed of an encoder, which encodes the question, and a decoder, which returns a probability distribution over KG  relations. The question entities (which, in our experiments, are provided from the datasets) and predicted relations are followed in the differentiable knowledge graph to return predicted answers. Predicted answers are compared to labeled answers, and the loss is used to update the model. Rigel-Baseline is illustrated in Figure 2.
We make the following key improvements to Rei-fiedKB. First, we use RoBERTa (Liu et al., 2019) as our encoder instead of word2vec. Second, Rei-fiedKB used different methods to determine the correct number of hops in multi-hop questions. We implement an attention mechanism using a hierarchical decoder W dec t , which is learned and a unified approach across datasets. Given a question embedding h q and relation vector r t , the resulting entity vector x t is computed as: We compute an attention score across all hops with: and compute the final estimateŷ as: Finally, while ReifiedKB used cross-entropy as its loss function, we instead use a multi-label loss function across all entities. This is because the output space in many samples contains multiple entities, so cross-entropy loss is inadequate.

The Intersection Model
In order to build a model that can handle multiple entities, we expand Rigel-Baseline with a differentiable intersection operation to create Rigel-Intersect. We define intersection as the elementwise minimum of two vectors. While differentiable intersection has previously been implemented as element-wise multiplication , we prefer to use minimum since it prevents diminishing probabilities. Given two vectors a and b, the element-wise minimum (min elem ) returns a minimum (min) at each index where min(a n , b n ) will return a n if a n < b n , or b n if b n < a n . Any element that appears in both vectors returns a non-zero  value, and elements that appears in one or neither vector return a 0.
... min(a n , b n )     (9) Next, we modify the encoder to allow entityspecific question embeddings. The Rigel-Baseline encoder creates one generic question embedding per question, but we may want to follow different relations for each entity. To calculate the question embedding h q using an encoder f q , for each question entity, we concatenate the question text q with the entity's mention or canonical name m separated by a separator token [SEP]. We use the embedding at the separator token index i SEP as the entity-specific representation of the question.
In the decoder, we predict inference chains in parallel for each entity, and follow the entities and relations in the differentiable KG to return intermediate answers. We intersect the two intermediate answers to return the final answer. In multi-hop settings, we weight entities in each vector before intersection based on the attention score. We train using the hyperparameters in Appendix A. Rigel-Intersect is illustrated in Figure 3.
Our implementation of intersection takes a maximum of two entities per question. We use the first two labeled entities per question and ignore subsequent ones. This works well for our datasets where 94% of questions contain a maximum of two entities. Given a dataset with more complex questions, we can extend our implementation in the future to an arbitrary number of entities.

Datasets
We use two datasets in our experiments: WebQuestionsSP (Yih et al., 2016) is an English question-answering dataset of 4,737 questions (2,792 train, 306 dev, 1,639 test) that are answerable using Freebase (Bollacker et al., 2008). During training, we exclude 30 training set questions with no answer, and during evaluation, we exclude 13 test set questions with no answer and count them as failures in our Hits@1 score. All questions are answerable by 1 or 2 hop chains of inference, so we set our models to a maximum of 2 hops. To create a subset of Freebase, we identify all question entities and relations in WebQuestionsSP and build a subgraph containing all facts reachable from 2 hops of the question entities, as done in previous works . This creates a sub-graph of 17.8 million facts, 9.9 million entities, and 670 relations. We create inverse relations for each relation (e.g., place of birth returns a person's birthplace; inv-place of birth returns people born in that location), for a total of 1,340 relations.
ComplexWebQuestions (Talmor and Berant, 2018) is an extended version of WebQuestionsSP with 34,689 questions (27,649 train, 3,509 dev, 3,531 test) in English requiring complex reasoning. During training, we exclude 163 questions that are missing question or answer entities, and during evaluation, we exclude 21 examples from the test set for the same reasons and count them as failures in the Hits@1 score. We limit our model to 2 hops for training efficiency. To create a subset of Freebase, we identify all question entities and relations in the dataset and build a subgraph containing all facts reachable within 2 hops of the question entities. This results in a subgraph of 43.2 million facts, 17.5 million entities, and 848 relations (1,696 including inverse relations).

Model WQSP CWQ
KVMem (Miller et al., 2016) 46.7 21.1 GRAFT-Net (Sun et al., 2018) 67.8 32.8 PullNet (Sun et al., 2019) 68.1 47.2 ReifiedKB  52.7 -EmQL  75.5 -TransferNet (Shi et al., 2021) 71.4 48.6 Rigel-Baseline (ours) 69.6 39.8 Rigel-Intersect (ours) 73.3 48.7  Our results are in Tables 1 and 2. Scores are reported as Hits@1, which is the accuracy of the top predicted answer from the model. Table 1 compares our scores to previous models. The aim of our paper is to show an extension of a promising KGQA technique, not to produce state-of-the-art results, but this table shows that Rigel-Intersect is competitive with recent models. Our improved Rigel-Baseline scores higher than ReifiedKB on WebQuestionsSP at 69.6%, and Rigel-Intersect improves upon that at 73.3%. On ComplexWebQuestions, Rigel-Baseline scores lower than recent methods at 39.8%, but Rigel-Intersect gets competitive results with 48.7%. The breakdown of results in Table 2 shows that the improved performance of Rigel-Intersect comes from better handling of questions with multiple entities. While Rigel-Baseline and Rigel-Intersect are comparable on questions with one entity, Rigel-Intersect surpasses Rigel-Baseline on questions with more than 1 entity by over 14% on .6%) and by 19% on Com-plexWebQuestions (36.8% vs. 55.8%). Example model outputs are in Appendix C.
Rigel-Baseline is not incapable of handling multiple-entity questions because not all questions require intersection. For example, in "Who played Jacob Black in Twilight?", the model can follow Jacob Black to the actor, Taylor Lautner, without intersecting with Twilight because only one actor has played Jacob Black. This is not possible for characters such as James Bond or Batman, who are portrayed by different actors in different movies. Although Rigel-Baseline can spuriously handle multiple-entity questions, Rigel-Intersect uses more accurate inference chains.

Conclusions
In this paper, we expand an end-to-end question answering model using differentiable knowledge graphs to learn an intersection operation. We show that introducing intersection improves performance on WebQuestionsSP and ComplexWebQuestions. This improvement comes primarily from better handling of questions with multiple entities, which improves by over 14% on WebQuestionsSP, and by 19% on ComplexWebQuestions. In future work, we plan to expand our model to more operations, such as union or difference, to continue improving model performance on complex questions. classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI'05, page 658-666, Arlington, Virginia, USA. AUAI Press.

A Model Hyperparameters
We train Rigel-Baseline and Rigel-Intersect using the hyperparameters shown below. For WebQues-tionsSP, we train on a single 16GB GPU. Training completes in approximately 12 hours. ComplexWe-bQuestions is a larger dataset with a larger knowledge graph, so we train on 4 16GB GPUs, with the knowledge graph distributed across 3 GPUs and the model on the fourth GPU.

B Validation and Test Performance
The

C Examples
The table on the following page shows example outputs of Rigel-Baseline and Rigel-Intersect. We only show the top predicted inference chain for each question, but in practice, a probability distribution over all relations is returned. The model also predicts how many hops to take based on an attention score. In our examples, if the model predicts one hop, we show only the top relation from the first hop. If the model predicts two hops, then we show the top relations for both hops. The first two examples are questions that Rigel-Intersect handles well. In both of these questions, there are multiple probable answers if only one entity is followed (i.e., Russell Wilson attended multiple educational institutions; Michael Keaton has played multiple roles). However only one answer is correct if all entities are considered. The final question is an example where Rigel-Baseline answers correctly even though there are two entities. This is because there is only one winner of the 2014 Eurocup Finals Championship, which Rigel-Baseline can identify without needing to check if the team is from Spain.