Improving Query Graph Generation for Complex Question Answering over Knowledge Base

Most of the existing Knowledge-based Question Answering (KBQA) methods first learn to map the given question to a query graph, and then convert the graph to an executable query to find the answer. The query graph is typically expanded progressively from the topic entity based on a sequence prediction model. In this paper, we propose a new solution to query graph generation that works in the opposite manner: we start with the entire knowledge base and gradually shrink it to the desired query graph. This approach improves both the efficiency and the accuracy of query graph generation, especially for complex multi-hop questions. Experimental results show that our method achieves state-of-the-art performance on ComplexWebQuestion (CWQ) dataset.


Introduction
Knowledge-based question answering (KBQA) is the task of finding answers to questions by processing a structured knowledge base KB. A KB graph consists of general facts which are organized as entity-relation-entity triplets, with entities as vertices and relations as edges. To answer a simple question such as: "Who is the president of the United States?", a typical KBQA system first identifies the entity (i.e., "United States") and the relation (i.e., "president") asked in the question, and then searches for the answer entity by matching the triplet fact query <United States, president, ?> over KB (Bordes et al., 2015;Yin et al., 2016;Yu et al., 2017;Zhang et al., 2018;Zhao et al., 2019). To answer a multi-hop question, multiple facts are extracted to form a structured representation, namely, a query graph (Yih et al., 2015). For example, the question "What was the name of the publisher for Disney Channel Magazine's first cartoon?" corresponds to a query graph that consists of 3 facts with grounded (i.e., topic entity) and ungrounded entities (i.e., "?"): <?, publisher, Disney Channel Magazine>, <?, cartoon, ?>, <?, published date, ?>, and a constraint: order by. The query graph can be converted to an executable query to find the answer in KB. Generating the query graph accurately and efficiently is the key challenge in KBQA.
While single-hop questions are easy to answer by searching for a single fact in KB, multi-hop questions are much harder to answer because the search space grows exponentially as the number of hops increases. The most common way to solve a multi-hop question is to first generate candidate query graphs and then validate and rank them down to one. Previous works construct candidate query graphs by starting with the topic entity and progressively expanding the graph (Bao et al., 2016;Liang et al., 2017;Zhou et al., 2018;Luo et al., 2018;Chen et al., 2019). They greedily determine what relation best fits the current incomplete query graph at each step, but fail to capture global properties of the complete query graph. At each step as the query graph grows, these step-wise models need to query the KB, measure semantic relevance, and update the model, which inevitably leads to high computational cost. For example, it takes more than two weeks to train the state-of-the-art model QGG (Lan and Jiang, 2020) on ComplexWebQuestion dataset (Talmor and Berant, 2018). To reduce the computation, these methods limit the maximum length of the search path to a small number, and use beam search to maintain only the top candidates at each step; thus causing some good candidates to be missed.
We instead propose a novel query graph generation method that works in quite the opposite manner: we start with the entire KB and gradually shrink it to the desired query graph. In the candidate query graph generation stage, in contrast to existing works that use expensive semantic relevance features, we only rely on cheap global features that capture syntactic matches with the query or structure matches with KB. This allows us to quickly Figure 1: Different stages of the proposed method on an example question with topic entity "Swiss Psalm". Rectangles represent relations and circles represent entities. The graph transformation can be summarized as: 1c − → answer. The index above the arrow denotes in which subfigure (stage) the transformation happens.
filter out a large number of low-quality candidate graphs, and run the computation-intensive ranking stage on a relatively-small number of promising graphs. Compared to previous approaches, ours is computationally efficient and could explore, in early stages, candidates graphs missed by previous methods. Experimental results show that our method delivers consistent performance on two KBQA datasets: it improves the state-of-the-art results by an absolute 5.8% in F1 on the multi-hop KBQA task CWQ and produces competitive result on the single-hop / two-hop KBQA task WQSP. In contrast, while some baseline methods work somewhat better on simple single-hop / two-hop questions, their performance drops dramatically on complex multi-hop questions.

Methodology
First we introduce some definitions and notations. For a given knowledge base KB, its associated relation graph RG is the undirected graph whose vertices correspond to the edges in KB, and whose edges correspond to the edge adjacencies in KB (two nodes in RG are connected if the corresponding two edges in KB share a common entity node). A query graph G q is a subgraph of KB, which reveals the semantic structure and the topic entities of the input question q. One can translate a query graph into an executable SPARQL query and execute it against the KB to obtain the answer entities. A typical query graph consists of four types of nodes (Yih et al., 2015): the grounded entity corresponding to e topic ; existential variables e ung , which are ungrounded entities; the lambda variable e ans , which is an ungrounded entity rep-resenting the answer. It is also a common practice to define some aggregation functions nodes for G q , which represent set operations (e.g., argmax and order by) on e ung or e ans .
To find the answer to a question q, our method shrinks KB down to G q in three stages: relation subgraph extractor predicts a relation subgraph RG q ⊂ RG for an input question q; query graph generator generates a set of candidate query graphs G q j ⊂ KB, j = 1, 2, · · · ; query graph ranker ranks G q j and selects the top one as the answer.
Relation Subgraph Extractor: we first extract a relation subgraph RG q ⊂ RG for an input question q. RG q only captures relations relevant to q (see a running example in Figure 1a). The reason of considering a relation graph (with relations as nodes) instead of an entity graph (with entities as nodes) is that we want the graph size to be small, and in typical KB the number of relations is much smaller than the number of entities. We identify the relations in RG q for the given question with a multi-label classifier. Specifically, we consider all annotated relations in the training data as potential labels and train binary relevance (Tsoumakas and Katakis, 2007;Wang et al., 2018) with logistic regression to determine the relevance of each relation. Unigrams from the questions are used as features. Because the given questions are very short, both extracting unigrams and running logistic regressions on sparse features are very fast. Next we extract a subgraph RG q ⊂ RG whose nodes correspond to the predicted relevant relations. In effect, we narrow down the search space from the entire KB (38M entities) to a small relation graph of about 10-20 relations.
Query Graph Generator: we aim to get a set of candidate query graphs G q j ⊂ KB, j = 1, 2, · · · . We start with the relation subgraph RG q ⊂ RG from previous stage, and further narrow down the search space by selecting some of its high-quality subgraphs RG q i ⊂ RG q , i = 1, 2, · · · , which represent relevant relation query structures. We propose to find such RG q i that leads to answers with high F1 scores using a logistic regression model. In practice, we consider all subgraphs of RG q with up to 5 nodes as candidates (see examples of the top three RG q i in Figure 1b). Four types of features are used to characterize each candidate subgraph RG q i : (1) Shape of the graph (categorical): two graphs have the same shape if they are isomorphic; 15 different shapes are identified.
(2) Relation relevance score (numeric): for relations in RG q i , we use their relevance scores generated in the previous stage. For relations not in RG q i , the scores are set to 0.
(3) Relation pairs: the presence of each adjacent relation pair is used as a binary feature. (4) (Relation, entity type, relation) triplets (binary): for each adjacent relation pair, we further combine it with the most common entity type of the connected node.
In training, we integrate RG q i with labeled topic entities to build a query graph, and execute it to get the answer and measure answer F1 score. We generate training data for logistic regression by including all positive subgraphs with high F1 scores and randomly sample 30 negative subgraphs. We turn F1 scores into binary training labels based on a threshold 0.9. Once the logistic regression model is trained using the extracted features, we can use it to score and rank RG q i . After selecting the top 50 RG q i , we couple them with topic entities e topic to build query graph candidates G q j . Each RG q i can be mapped to multiple G q j by inserting different topic entities at different positions and assigning different relation directions (see an example of G q j in Figure 1c). After this step, the generated G q j only contains one topic entity. To build a query graph for a question with multiple topic entities, we merge the generated query graphs with similar relation structures but different entities. For example, e ung , where r1 and r2 correspond to two different relations.
Query Graph Ranker: we rank the candidate query graphs G q j , j = 1, 2, · · · , and use the top Keywords SPAQRL constraints prior, before, FILTER ( var 1 <"var 2 "^^xsd:dateTime) after FILTER ( var 1 >"var 2 "^^xsd:dateTime) less FILTER (xsd:integer(var 1 ) <var 2 ) greater, more FILTER (xsd:integer(var 1 ) >var 2 ) earliest, smallest, first ORDER BY var 1 LIMIT 1 largest, most, last ORDER BY DESC var 1 LIMIT 1 one G q best to produce the final answer to the input question q. Since there are only a small number of generated candidate query graphs, we can afford to evaluate them with a powerful and expensive model. We use the Albert  to compute the matching score between G q j and q. To represent G q j as a sequence of tokens and concatenate it with q, we locate the source node 1 and concatenate all paths starting from the source node to make a sequence. The example in Figure 1c has source node e 0 and two paths that point to e topic and e 2 separately. We represent the ungrounded entities as their entity types. Then the concatenation is sent to the Albert model to get a matching score.
Following previous work (Luo et al., 2018), we further augment G q best with constraints based on a set of predefined rules. This is necessary for detecting time and number constraints in superlative and comparative questions. The rules consist of mappings from keywords to SPARQL constraints as shown in Table 1. Take as an example the input question "Who were the presidents of the United States before 2020?" and the predicted query graph "SELECT distinct * from <cwq> WHERE { <US> <president> ?e . }". The model first detects the keyword "before" from the question, and then learns "var 1 " and "var 2 " to be "?e" and "2020" based on question and the predicted graph. At the end, it couples the generated SPARQL constraints with the predicted query to generate the final query.
Then, we execute the query against the KB to obtain the answer. Because all nodes in G q best except the grounded entities and CVT nodes 2 can potentially be the answer node, in the last step we resolve the answer node using a simple heuristic: we compare G q best with all annotated query graphs in the training set to select graphs which are isomorphic to G q best . From the selected graphs, we choose CWQ WQSP HR-BiLSTM (Yu et al., 2017) 31.2 † 62.3 † GRAFT-Net (Sun et al., 2018) 26.0 † 62.8 KBQA-GST (Lan et al., 2019) 36.5 67.9 TEXTRAY (Bhutani et al., 2019)   the node (e.g., the top left node) which has most often been the answer node in the dataset. This heuristic achieves over 90% accuracy in practice.

Results and Analysis
We conduct experiments on two popular multi-hop KBQA datasets, COMPLEXWEBQUESTION-1.1 (CWQ) (Talmor and Berant, 2018) and WEBQUES-TIONSP (WQSP) (Yih et al., 2015). CWQ dataset has 34,689 complex questions (2-5 hops), while WQSP dataset contains 4,737 simple questions (1 or 2 hops). In this work, we use CWQ for the main evaluation because our method is designed for complex questions. Both datasets use Freebase (Google, 2013) as the supporting knowledge base. We implement our model using NETWORKX (Hagberg et al., 2008), PYTORCH-1.6.0 (Paszke et al., 2019), and Huggingface (Wolf et al., 2019). For entity linking, we take a union of AllenNLP (Gardner et al., 2017) and Stanford NER (Finkel et al., 2005) outputs in CWQ experiments and use S-MART (Yang and Chang, 2016) in WQSP experiments. We further build an uppercase detector to add uppercase words to the ensembling results. For entity type linking, we search for entity types from the Freebase via two relations, ns:common.topic.notable_types and ns:type.object.name. For Albert training, we initialize the model with pre-trained weights and fine-tune it on the corresponding KBQA dataset for 5 epochs. The model has 12 layers, 4096 hidden dimensions, and 64 attention heads. We set learning rate to 1e −5 and limit the maximum length of input sequence to 128 tokens.

Experimental Results
Table 2a compares our method with state-of-theart models. We adopt the F1 score between the predicted answer set and the ground truth answer set as our main evaluation metric. Experimental results show that our method outperforms existing methods on CWQ, while staying competitive on WQSP. We can see that most previous methods perform very well on WQSP but poorly on CWQ. This is because the "step-wise growing" methods have to restrict search space in order to be tractable on complex question datasets, and that causes good query graph candidates to be missed, ultimately hurting the performance on CWQ. In the query graph generation stage during training, the search space of previous methods is Θ(n t ) without beam search or Θ( t bn) with beam search, while ours is Θ( k t ), where t is the maximum number of hops, n is the average number of degrees in KB, b is the beam size, and k is the number of nodes in RG q . In practice, we have roughly n = 70, 3 ≤ b ≤ 8, k = 15, and t = 5 on CWQ and t = 2 on WQSP. Our search space is not as restricted as the previous methods using beam search but is still tractable. On CWQ our model only took 1 day to train while the second best model QGG (Lan and Jiang, 2020) took 2 weeks.
To further disentangle the contributions of different factors in our method, we present an ablation test on CWQ in Table 2b. Among four features used in query graph generator, the (relation, entity type, relation) triplet feature has the biggest impact on the performance. Features such as shape and entity type are global features that capture important priors about the graph. Without them, performance drops in both query generation and ranking stages. It is also necessary to extract paths to make a sequential input to Albert. If instead, we simply concatenate triplet facts into a sequence, even though this saves time to detect source nodes and paths, the performance drops by 1.4%. We also observe a significant performance boost by predicting top 5 predictions instead of just the top 1, implying that correct answers are still ranked high when they are not at rank 1. Table 2c shows separate performance of each component and how overall F1 score changes through the three stages. We show the cardinality of the output (# of predictions) in each stage. On CWQ, the entity linking models do not work well, as almost half of the questions in CWQ have multiple topic entities which makes the linking task difficult. On WQSP, the bottleneck is extracting relation graph RG q , likely due to overfitting on small training data. By adding gold relations to RG q and using the same setup as the current WQSP experiment, we got 84.1% upper bound F1 in query graph generator and 73.3% F1 in the final output. This score is comparable to the state-of-the-art model.

Error Analysis
As shown in Table 2c, the entity linking model does not perform very well on CWQ dataset. We note that most of the questions in CWQ contain multiple topic entities, which makes the prediction job more challenging than it is on WQSP. This is the main reason why there is a big performance gap between CWQ and WQSP. In addition to that, we notice several difficulties of doing entity linking on CWQ.
(1) Typo in the dataset: "Bill Clinton" is mistakenly spelled as "Bill Clnton". (2) Name is not unique: there are more than one "Michael Jordan" in the knowledge graph. There is no automatic way to determine which "Michael Jordan" the question refers to. (3) Topic entity can be a generic word: in question "What art movement do the artists who study perspective belong to?", the topic entity is "perspective". It is difficult to detect it with a regular entity linking tool. (4) Disambiguation: similar to (2), the model needs to map the extracted words to corresponding entities in the knowledge graph. Even if the entity is unique in KB, it is not always easy to perfectly perform the mapping.
We see a significant F1 score drop on CWQ dataset in the last stage (from 0.741 to 0.462). We take a closer look at failure cases and observe that the model has difficulty in distinguishing very similar relations. For example, for the question "Where did the subject of the movie 'I'm Not There live'?", the model predicts a graph with the relation "place_of_birth", while the graph with the correct relation "places_lived" is ranked second. This is because, in training set, a similar question "In the film 'Lydia Bailey', where did the subject live?" is linked with relation "place_of_birth", whereas the better relation "places_lived" is not annotated. This kind of issue could be alleviated by annotating more positive samples or encouraging the model to explore unlabeled data in training (Qin et al., 2020). A rather tricky issue is that the relation "ns:location.country.languages_spoken" is usually mistakenly predicted as "ns:location.country.offi-cial_language", or the other way around. These two relations are represented by similar features in the embedding space and thus easily confuse the model. Specifically, they appear 1,707 and 825 times in the training set, and in more than half of the cases they are perfectly interchangeable or generate very close answers. To distinguish such similar relations, the model needs a large number of samples to learn the subtle difference between the two.

Conclusion
We propose a novel query graph generation method by gradually shrinking a KB to a desired query graph. Compared to previous approaches, our approach is more computationally efficient. Experiments show that our method delivers consistent performance on two KBQA datasets: it improves the state-of-the-art results by an absolute 5.8% in F1 on the multi-hop KBQA task CWQ and produce competitive result on the single-hop / two-hop KBQA task WQSP. In contrast, while some baseline methods work somewhat better on simple single-hop / two-hop questions, their performance drops dramatically on complex multi-hop questions.