ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering

We present Retriever-Transducer-Checker (ReTraCk), a neural semantic parsing framework for large scale knowledge base question answering (KBQA). ReTraCk is designed as a modular framework to maintain high flexibility. It includes a retriever to retrieve relevant KB items efficiently, a transducer to generate logical form with syntax correctness guarantees and a checker to improve transduction procedure. ReTraCk is ranked at top1 overall performance on the GrailQA leaderboard and obtains highly competitive performance on the typical WebQuestionsSP benchmark. Our system can interact with users timely, demonstrating the efficiency of the proposed framework.


Introduction
Knowledge base question answering (KBQA) is an important task in natural language processing that aims to satisfy users' information needs based on factual information stored in knowledge bases. Over the years, it has attracted a great deal of research attention from academia and industry. Early KBQA systems are generally rule-based. They rely on predefined rules or templates to parse questions into logical forms (Cabrio et al., 2012;Abujabal et al., 2017), suffering from coverage and scalability problems. Recently, researchers usually focus more on neural semantic parsing approaches. These data-driven parsing methods (Yih et al., 2015;Jia and Liang, 2016;Dong and Lapata, 2016;Liang et al., 2017;Gu et al., 2021) significantly improve the state-of-the-art (SOTA) performance on KBQA tasks. Although various neural semantic parsing methods have been proposed for KBQA, there are few works investigating how to leverage the advantages of SOTA models to build a comprehensive system, and how to fit the system with practical application purpose (e.g., balancing effectiveness and efficiency). To investigate, we identify two key issues hindering the development of KBQA systems.
On the one hand, there is a lack of a generic and extensible framework for KBQA. For example, the popular SEMPRE 3 toolkit (Berant et al., 2013) provides infrastructures to develop statistical semantic parsers for KBQA with rich features, but its performance and scalability are inferior to recent neural semantic parsing methods. The TRANX toolkit 4 (Yin and Neubig, 2018) employs a transition-based neural semantic parser to model the logical form generation procedure as a sequence of tree-constructing actions under grammar specification. However, TRANX does not include the essential retriever components used in grounding, and thus does not support KBQA by now.
On the other hand, recent neural semantic parsing methods mostly emphasize performance on benchmark datasets while neglecting the efficiency (speed) dimension. This limits the understanding of how designed approaches fit into real applications. For example, the popular query graph generation methods generate and rank a set of query graphs (Yih et al., 2015;Maheshwari et al., 2019;Lan and Jiang, 2020). Since all query graph candidates keep in line with the knowledge base (KB) structure, these methods take full advantage of the KB. However, they suffer from poor efficiency due to the large number of candidates and heavily querying on KB. To verify that, we performed a preliminary study on available SOTA models 5,6,7,8,9,10 . According to our study, these models either have difficulties in supporting interactive online services, or limit the candidate space for specific datasets, which makes them difficult to apply in practice.
To this end, we present ReTraCk, a practical framework for large scale KBQA. We hope Re-TraCk can help standardize the KBQA model design process and lower the barrier of entry for new practitioners. ReTraCk is designed with the following principles in mind: • Flexibility ReTraCk employs a modular architecture, which decouples the dependencies among components as much as possible to enable quick integration of novel components. For example, our system supports two different kinds of schema retrievers, namely dense schema retriever and neighbor schema retriever 11 .
• Efficiency ReTraCk falls into the transduction family, which is fast during the generation process. Besides, we retrieve entities and relevant schema items (relations and types) in parallel by leveraging the recent advance of entity linking (Orr et al., 2021) and dense retrieval (Wu et al., 2020;Karpukhin et al., 2020). Our system can interact with users timely, demonstrating the efficiency of the proposed ReTraCk framework.
• Effectiveness ReTraCk is designed to enhance the controllability of transduction-based methods in both syntax level and semantic level. It first employs a grammar based decoder (Yin and Neubig, 2018) to guarantee the syntax correctness. Then it leverages a checker to alleviate the semantic inconsistency issues. Inspired by previous work, four checking mechanisms are proposed and implemented in the checker: instancelevel checking (Liang et al., 2017), ontologylevel checking (Chen et al., 2018), real execution (Wang et al., 2018) and the novel virtual execution. The experimental results verify the significant effectiveness of our proposed checker. Notably, the checker is also flexible enough to be easily extended with new mechanisms. Finally, ReTraCk achieves state-of-the-art performance on GrailQA and achieves highly competitive performance on WebQuestionsSP.

ReTraCk Framework
Given an input question q, ReTraCk parses the question into a logical form which can be deterministically converted into a SPARQL query to retrieve answers from the knowledge base K. Generally K consists of two parts: an ontology O ⊆ T ×R×T , which defines the schema structure, and the fact triples F ⊆ E × R × (E ∪ T ∪ L). Here, T is the set of types, R is the set of relations, E is the set of entities, and L is the set of literals. As shown in Fig. 1, ReTraCk consists of three components: retriever, transducer and checker. The retriever consists of an entity linker, which links explicit entity mentions to corresponding entities, and a schema retriever, which retrieves relevant schema items (types and relations) mentioned either explicitly or implicitly in the question. Given the retrieved KB items (entities, types, and relations), the transducer employs a grammar-based decoder to generate the logical form with syntax correctness guarantees. Meanwhile, the transducer interacts with the checker to discourage generating programs that are semantically inconsistent with KB.
To make ReTraCk more accessible and interpretable for end users, we build a user interface. As shown in Fig. 2, users can type a question in the text box. The interface then displays retrieved KB items, a graph visualization of predicted logical forms, generated SPARQL query and predicted answer (s). The schema items selected by our transducer are shaded. Besides, users can refer to more information of any KB item by clicking on the subsequent "Detail". Next, we will introduce each component in detail.

Retriever
Entity Linker The entity linker used in this work follows the entity linking pipeline described in Gu et al. (2021). It firstly detects entity mentions using a BERT-based NER system, then generates candidate entities along with their prior score based on an alias map mined from the KB and FACC1 (Gabrilovich et al., 2013). As for entity disambiguation, we implement a prior baseline which selects the most popular entity based on the prior score. Besides, we also implement an alternative model by leveraging BOOTLEG (Orr et al., 2021) enriched with the prior features 12 . Due to space limitations, the model details and its comparison with the entity linker used in Gu et al. (2021) are put in the Appendix.
Schema Retriever As schema items are not always mentioned explicitly in the question and their vocabularies are much fewer than entities 13 , we leverage the dense retriever framework (Mazaré et al., 2018;Humeau et al., 2020;Wu et al., 2020) to obtain the related types and relations. To be specific, we train a bi-encoder architecture (Wu et al., 2020) such that related schema items are close to the question embedding. This architecture allows for fast real-time inference, as it is able to cache the encoded candidates.
We use two independent BERT-base encoders (Devlin et al., 2019) to represent the input question e q and candidate schema items e s by extracting the upper most layer representation corresponding to the [CLS] token. The matching score for each pair (q g , s i ) is calculated by the dot-product: Given a question q, we retrieve the top k schema items with the highest scores during inference time.

Transducer
Following previous work (Guo et al., 2018(Guo et al., , 2019) -especially the s-expression design principle (Gu et al., 2021), we design a set of grammar rules for the logical form. As shown in Table 1, there are two kinds of grammars in our definition: knowledgeagnostic grammar and knowledge-specific grammar. To incorporate these predefined grammar rules, we introduce a question encoder and a grammar-based decoder (Liu et al., 2020). {e1 | (e1, e2) ∈ rel and e2 ∈ set} set→ argmax(set, rel)
Grammar-based Decoder Once the question representation is prepared, the grammar-based decoder starts to produce the target logical form step by step with attention on the question. Our decoder regards each logical form as a structure and outputs its corresponding grammar rule/action 14 sequence a = (a 1 , · · · , a K ).
At each decoding step, a nonterminal (e.g., set) is expanded using one of its valid grammar rules. For example, at time step k, the LSTM decoder LSTM − → D accepts the embedding of the previous output φ a (a k−1 ) as input and updates its hidden state as: 14 We use grammar rule and action interchangeably.
where c k−1 is the context vector obtained by attending on each encoder hidden state h E i . As for φ a , it behaves differently for knowledge-agnostic grammar rules and knowledge-specific grammar rules. For knowledge-agnostic grammar rules, φ a returns a trainable global embedding. For knowledgespecific grammar rules, φ a returns its related KB item representation, obtained by averaging over all word representations.
When predicting a k , the probability of selecting the action γ follows: BERT Encoding Motivated by the success of pretrained language models on cross-domain textto-SQL tasks (Hwang et al., 2019), we augment our model with BERT (Devlin et al., 2019). First, we concatenate the questions with all retrieved KB items as input for BERT to strengthen the connection between them. Then, we replace the word embeddings mentioned above with deep contextual representations from the last layer of BERT of each question token and each KB item, respectively. In a case where the total number of words in the retrieved KB items exceeds the maximum length constraint of BERT, we split these KB items into different blocks and encode them with the question separately (Gu et al., 2021).

Checker
Inspired by previous work (Liang et al., 2017;Chen et al., 2018;Wang et al., 2018), we design a pluggable module named checker to improve the decoding process by leveraging semantics of KB.
Instance-level Checking relies on the KB linkage information at the instance level (i.e., entities and their connected relations), which means that instance-level checking only deals with cases where the current action is a child node of action set→ join ent (rel, ent) in the abstract syntax tree (AST). As illustrated in Fig. 4, when expanding the nonterminal ent, any retrieved KB entity can return a valid grammar rule such as ent→m.04bmk or ent→m.04vd3. However, only m.04vd3 can pass the instance-level checking, since other candidates do not share direct links with the decoded relation tv.tv episode segment.subjects.  Ontology-level Checking performs checking with the help of KB linkage information at the ontology level (i.e., types and bridging relations). Taking the right subtree presented in Fig. 4 as an example, when expanding the second rel, we employ ontology-level checking to determine its valid semantic scope. According to the semantics of the grammar rule set→ join rel (rel 1 , rel 2 ), the type set of the head entity in rel 2 must overlap with the type set of the tail entity in rel 1 , by which the candidate rel→tv.tv program.number of episodes is selected. Although ontology-level checking applies to more situations than instance-level, it is weaker in terms of checking effectiveness and needs constraints of high coverage.
Real Execution When decoding reaches the end, an action sequence can be converted into a logical form, and finally into a SPARQL query. As depicted in Fig. 4, the real execution simply takes the final SPARQL query and tries to execute it over KB. If the query cannot be executed successfully, or the result is empty, it means that the corresponding action sequence cannot meet the executable requirement. In practice, we utilize the real execution to check all complete action sequence candidates searched by the beam search procedure, until an action sequence passes checking.

Virtual Execution
The real execution cannot intervene in the middle of program generation, which leads to candidates of low quality in the final beam (e.g., no candidate can be executed). Meanwhile, since real execution relies on SPARQL, it is relatively slow as SPARQL queries are executed over tremendous (e.g., millions) entities with multi-hop relations. Instead, we propose virtual execution to alleviate these issues. As illustrated in Fig. 4    with previous work, we use F1 and Hits@1 as evaluation metrics on WebQSP.

Implementation Details
We implemented our model based on PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018). With respect to BERT, we utilize the uncased BERT-base model from the Transformers library (Wolf et al., 2020). In training, we employed the Adam optimizer (Kingma and Ba, 2015). The learning rate is set to 1e-3, except for BERT, which is set to 2e-5. Our model training time on a single Tesla V100 is approximately 20h 16 . As for dense retriever, on GrailQA dataset, we retrieve top-100 type items and top-150 relation items. On WebQSP dataset, we retrieve top-200 type items and top-500 relation items.

Baseline Models
We compare our model with previous state-of-theart models on GrailQA (Lan and Jiang, 2020;Gu et al., 2021) and WebQSP (Liang et al., 2017;Sun et al., 2019;Saxena et al., 2020;Lan and Jiang, 2020). Notably, both TRANSDUCTION and RANKING models proposed by Gu et al. (2021) on GrailQA can be based on either GloVe (Pennington et al., 2014) or BERT (Devlin et al., 2019). We compare with them under all settings.

Results
We test ReTraCk with two configurations, with or without Checker. As shown in Table 2, Re-TraCk significantly outperforms the previous SOTA model BERT + RANKING (F1 +7.3, EM +7.5 ) and achieves an improvement (F1 +28.5, EM + 24.8) over the previous best transduction-based model BERT + TRANSDUCTION on GrailQA. Table 3 shows model performance on WebQSP. Given predicted entities, our model outperforms previous models (except for QGG (Lan and Jiang, 2020)) and even outperforms these models with oracle entities: GRAFT-Net, PullNet, and Embed-KGQA. Given oracle entities, the performance of our model further boosts to 74.7 F1, which shows the potential gains with a better entity linker.
While most SOTA models constrain their answer space by assuming a fixed number of hops, we conduct experiments on both datasets without such assumptions, which simulates real world scenarios. QGG works well on WebQSP by accessing the KB via SPARQL when generating the query graph at each step. However, as noted in Gu et al. (2021), extending QGG to consider 3-hops relations on GrailQA will take a few months to train, which is time consuming. It works poorly on GrailQA  under 2-hop assumption. By removing the checker module, the performance drops 21.1 and 14.1 F1 points on GrailQA and WebQSP respectively, which demonstrates the significant effectiveness of the checker. Except for QGG mentioned above, GrailQA RANKING model takes an average 115.5 seconds 17 to process one query, which is not applicable for online systems. In contrast, ReTraCk takes only 1.62 seconds per query on average at its current implementation which demonstrates its efficiency.

Case Study
To demonstrate ReTraCk's capability, we show three typical examples from the development set of GrailQA dataset in Table 4. In the first case, ReTraCk accurately links two mentions (don slater and editor in chief ) in the query to corresponding entities (m.05ws t6 and m.02wk2cy) in Freebase. It also retrieves all necessary schema items (three relations and one type) via schema retriever. The transducer equipped with checker accurately understands the meaning of query and compose the complex logical form with five operators. The predicted logical form is exactly the same as the golden logical form. As for the second case, Re-TraCk parses the query to a logical form which is semantically equivalent to the golden logical form, which demonstrates the existence of program alias. As for the third case, ReTraCk ignores the seman-17 Data are derived from https://github.com/dki-lab/ GrailQA tics conveyed by the word surface in the query, and selects wrong schema item unit of density instead of unit of surface density. This example shows that our model sometimes only captures part of the semantics in the query and misses some span information.

Conclusion
We present ReTraCk, a semantic parsing framework for KBQA. ReTraCk is flexible and efficient, achieving strong results on two distinct KBQA datasets. We hope that ReTraCk will be beneficial for future research efforts towards developing better KBQA systems.

A Entity Linker
The entity linker used in this paper follows the typical pipeline that consists of three sub-modules: mention detection, candidate generation and entity disambiguation. Following the previous work Gu et al. (2021), we use a BERT-based NER system 18 to detect the entity mentions and literals (e.g., numerical values and datetime) in the question. Then we generate candidate entities along with their prior probability using an alias map mined from the KB and FACC1 (Gabrilovich et al., 2013), a large entity linking corpus. For entity disambiguation, we adopt the state-ofthe-art neural entity disambiguation model BOOT-LEG (Orr et al., 2021) 19 which shows decent generalization performance over long-tail entities. In BOOTLEG, each entity is represented with three levels information: its unique entity embedding, attached types' embedding and relations' embedding, and leverage BERT (Devlin et al., 2019) to encode the context. Besides, we also combine the prior score from the candidate generation step and the context compatibility score from BOOTLEG with two fully connected layers of 100 hidden units and ReLU non-linearities. Note that existing KBQA datasets do not provide the mention boundary annotations. We generated the distantly supervised training data for both named entity recognition and entity disambiguation by aligning the natural language question with entities' observed aliases mined from the candidate generation step.
We evaluate the performance of our entity linker on GrailQA dev set and WebQSP test set. We compare its performance with the following baselines: 1) Aqqu (Bast and Haussmann, 2015) which is a rule based entity linker using linguistic and entity popularity features. 2) GrailQA (Gu et al., 2021) which is a prior baseline. 3) Prior which is a prior baseline implemented by us. 4) BOOTLEG (Orr et al., 2021) which is trained using distantly aligned question answering data. 5) BOOTLEG + Prior which is the full disambiguation model used in this paper.
As you can see from Table 5, our Prior performs slightly better than the GrailQA (Gu et al., 2021)'s Prior by 0.8 F1 points on GrailQA. What's interesting is that the BOOTLEG trained with GrailQA data is even inferior than Prior baseline by 4.8 F1 points. However, BOOTLEG + Prior improves over BOOTLEG and Prior by 4.4 F1 points and 9.2 F1 points respectively. The above experiment results show that the prior feature is very important and orthogonal to the BOOTLEG model in the question entity linking. As shown in Table 6, similar conclusions can be derived from the experiment results on WebQSP dataset. Compared with experiments on GrailQA, the performance of BOOTLEG is lower with only 58.5 F1 score and the improvement of BOOTLEG + Prior over Prior is reduced by 1.7 F1 points. This is mainly because the size of training data of WebQSP (3,098 instances) is much smaller than GrailQA (44,337 instances) which limits the learning of BOOTLEG model.

B Dense Schema Retriever
In principle, the encoders can be implemented by any neural networks (Karpukhin et al., 2020). We use two independent BERT-base encoders (Devlin et al., 2019).
Training The goal of training the encoders is to create a vector space such that relevant schema   items get higher scores with the given question. For each pair of question and schema item (q i , s i ) in a batch of size B, the loss is computed as: In-batch negatives have shown to be effective for learning a bi-encoder architecture (Karpukhin et al., 2020). To use in-batch negatives, we separate relevant schema items of the same question into different mini-batches. In this way, there are B training instances in each batch and B − 1 negative candidates for each question.
Dense Schema Retriever v.s. Neighbor Schema Retriever To prune the decoding vocabulary space, Gu et al. (2021) retrieves schema items that are reachable by anchor entities within 2-hops in KB, which is named after neighbor schema retriever. In this section, we compare the performance of dense schema retriever proposed in this work with the neighbor schema retriever. Fig. 5 shows the recall of the schema items with respect to top-k retrieved candidates on GrailQA dev set. Neighbor schema retriever obtains 69.2% type recall with an average of 112.1 candidate items while dense schema retriever achieves 73.3% recall with only 2 candidates and 98.5% recall with 100 candidates. Similar trends can be found in the relation recall curve in Fig. 5. Dense schema retriever not only improves the recall of schema items, but also reduces the candidate size, which benefits the downstream transducer model.

C Checking Procedure
The usage of 4 functions (instance checking, type checking, virtual execution and real execution) are explained in the paper. Here we present an algorithm to introduce the checking procedure better, as show in Algorithm 1.

D Detailed Hyper-parameter Setting
Entity Linker For the BERT-based NER model, we use the uncased BERT-base model from the Transformers library trained with AdamW optimizer (learning rate: 5e-5) for 5 epochs. For the entity disambiguation model, we use the default parameters from BOOTLEG. On GrailQA dataset, we use the uncased BERT-base model trained with SparseDenseAdam optimizer implemented by BOOTLEG (learning rate: 1e-4) for 5 epochs. We add two fully connected layers of 100 hidden units and ReLU non-linearities to combine BOOTLEG and the prior score feature. The entity embedding size is set to 256, type and relation embedding size is set to 128. The entity embedding mask percentage is set to 0.8. On the smaller dataset WebQSP, except training with a larger number of epochs (50), and the embedding size is set to 64 to avoid overfitting, everything is the same as the model on GrailQA. Through our experiments, we select the best model based on the F1 score on Algorithm 1 Checking Process Input: valid action candidates C, decoded logical form beam L, knowledge base K Output: logical form beam for the next stepL Algorithm: L = / O Procedure static checking(C, L, K) for each action sequence s in L do for each valid action candidate c in C do if not instance checking(s, c) then continue if not ontology checking(s, c) then continue novel checking techniques can be added herê s ← s1, s2, · · · , s |s| , c L ←L ∪ {ŝ} L = kbest beam(L, k) keep top k scoring candidates inL Procedure dynamic checking(L) for each action sequenceŝ inL do τ =ŝ |ŝ| While τ corresponds to a full sub-program do r = virtual execution(τ ) if not r then L ←L removeŝ break τ ← parent node of τ in AST tree ifŝ arrives at the end then r = real execution(ŝ) if r then L ← {ŝ} only keep the first executableŝ break returnL dev set of each dataset. We pass top-3 and top-5 candidate entities per entity mention to the downstream transducer model on GrailQA and WebQSP dataset respectively.
Dense Schema Retriever We use the uncased BERT-base model from the Transformers library trained with AdamW optimizer (learning rate: 1e-5) for 10 epochs. We select the best model based on the recall of schema items on the dev set of each dataset. On GrailQA dataset, we retrieve top-100 type items and top-150 relation items. On WebQSP dataset, we retrieve top-200 type items and top-500 relation items.
Parser We implement our model based on Py-Torch and AllenNLP. With respect to BERT, we use the uncased BERT-base model from Transformers library. In training, we employ the Adam optimizer. The learning rate of our model is set to 1e-3, except for BERT, which is set to 2e-5. The training time of our model on single Tesla V100 is approximately 20 hours. We select the best model based on the exact match ratio between the predicted logical form and golden logical form.