Case-based Reasoning for Natural Language Queries over Knowledge Bases

It is often challenging to solve a complex problem from scratch, but much easier if we can access other similar problems with their solutions — a paradigm known as case-based reasoning (CBR). We propose a neuro-symbolic CBR approach (CBR-KBQA) for question answering over large knowledge bases. CBR-KBQA consists of a nonparametric memory that stores cases (question and logical forms) and a parametric model that can generate a logical form for a new question by retrieving cases that are relevant to it. On several KBQA datasets that contain complex questions, CBR-KBQA achieves competitive performance. For example, on the CWQ dataset, CBR-KBQA outperforms the current state of the art by 11% on accuracy. Furthermore, we show that CBR-KBQA is capable of using new cases without any further training: by incorporating a few human-labeled examples in the case memory, CBR-KBQA is able to successfully generate logical forms containing unseen KB entities as well as relations.


Introduction
Humans often solve a new problem by recollecting and adapting the solution to multiple related problems that they have encountered in the past (Ross, 1984;Lancaster and Kolodner, 1987;Schmidt et al., 1990). In classical artificial intelligence (AI), case-based reasoning (CBR) pioneered by Schank (1982), tries to incorporate such model of reasoning in AI systems (Kolodner, 1983;Rissland, 1983;Leake, 1996). A sketch of a CBR system (Aamodt and Plaza, 1994) comprises of -(i) a retrieval module, in which 'cases' that are similar to the given problem are retrieved, (ii) a reuse module, where the solutions of the retrieved cases are reused to synthesize a new solution. Often, the new solution does not work and needs more revision, which is handled by a (iii) revise module.
In its early days, the components of CBR were implemented with symbolic systems, which had their limitations. For example, finding similar cases and synthesizing new solutions from them is a challenging task for a CBR system implemented with symbolic components. However, with the recent advancements in representation learning (LeCun et al., 2015), the performance of ML systems have improved substantially on a range of practical tasks.
Given a query, CBR-KBQA uses a neural retriever to retrieve other similar queries (and their logical forms) from a case memory (e.g. training set). Next, CBR-KBQA generates a logical form for the given query by learning to reuse various components of the logical forms of the retrieved cases. However, often the generated logical form does not produce the right answer when executed against a knowledge base (KB). This can happen because one or more KB relations needed are never present in the retrieved cases or because KBs are woefully incomplete (Min et al., 2013) (Figure 1). To alleviate such cases, CBR-KBQA has an additional revise step that aligns the generated relations in the logical form to the query entities' local neighborhood in the KB. To achieve this, we take advantage of pre-trained relation embeddings from KB completion techniques (e.g. Trans-E (Bordes et al., 2013)) that learn the structure of the KB.
It has been shown that neural seq2seq models do not generalize well to novel combinations of previously seen input (Lake and Baroni, 2018;Loula et al., 2018). However, CBR-KBQA has the ability to reuse relations from multiple retrieved cases, even if each case contains only partial logic to answer the query. We show that CBR-KBQA is effec-tive for questions that need novel combination of KB relations, achieving competitive results on multiple KBQA benchmarks such as WebQuestionsSP (Yih et al., 2016), ComplexWebQuestions (CWQ) (Talmor and Berant, 2018) and CompositionalFree-baseQuestions (CFQ) (Keysers et al., 2020). For example, on the hidden test-set of the challenging CWQ dataset, CBR-KBQA outperforms the best system by over 11% points.
We further demonstrate that CBR-KBQA, without the need of any further fine-tuning, also generalizes to queries that need relations which were never seen in the training set. This is possible due to CBR-KBQA's nonparametric approach which allows one to inject relevant simple cases during inference, allowing it to reuse new relations from those cases. In a controlled human-in-the-loop experiment, we show that CBR-KBQA can correctly answer such questions when an expert (e.g. database administrator) injects a few simple cases to the case memory. CBR-KBQA is able to retrieve those examples from the memory and use the unseen relations to compose new logical forms for the given query.
Generalization to unseen KB relations, without any re-training, is out of scope for current neural models. Currently, the popular approach to handle such cases is to re-train or fine-tune the model on new examples. This process is not only time-consuming and laborious but models also suffer from catastrophic forgetting (Hinton and Plaut, 1987;Kirkpatrick et al., 2017), making wrong predictions on examples which it previously predicted correctly. We believe that the controllable properties of CBR-KBQA are essential for QA models to be deployed in real-world settings and hope that our work will inspire further research in this direction.
Recent works such as REALM (Guu et al., 2020) and RAG (Lewis et al., 2020b) retrieve relevant paragraphs from a nonparametric memory for answering questions. CBR-KBQA, in contrast, retrieves similar queries w.r.t the input query and uses the relational similarity between their logical forms to derive a logical form for the new query. CBR-KBQA is also similar to the recent retrieve and edit framework (Hashimoto et al., 2018) for generating structured output. However, unlike us they condition on only a single retrieved example and hence is unlikely to be able to handle complex questions that need reuse of partial logic from multiple questions. Moreover, unlike CBR-KBQA, retrieve and edit does not have a component that can explicitly revise an initially generated output.
The contributions of our paper are as follows -(a) We present a neural CBR approach for KBQA capable of generating complex logical forms conditioned on similar retrieved questions and their logical forms. (b) Since CBR-KBQA explicitly learns to reuse cases, we show it is able to generalize to unseen relations at test time, when relevant cases are provided. (c) We also show the efficacy of our revise step of CBR-KBQA which allows to correct generated output by aligning it to local neighborhood of the query entity. (d) Lastly, we show that CBR-KBQA significantly outperforms other competitive models on several KBQA benchmarks.

Model
This section describes the implementation of various modules of CBR-KBQA. In CBR, a case is defined as an abstract representation of a problem along with its solution. In our KBQA setting, a case is a natural language query paired with an executable logical form. The practical importance of KBQA has led to the creation of an array of recent datasets (Zelle and Mooney, 1996;Bordes et al., 2015;Su et al., 2016;Yih et al., 2016;Zhong et al., 2017a;Ngomo, 2018;Yu et al., 2018;Talmor and Berant, 2018, inter-alia). In these datasets, a question is paired with an executable logical form such as SPARQL, SQL, S-expression or graph query. All of these forms have equal representational capacity and are interchangeable (Su et al., 2016). Figure 2 shows an example of two equivalent logical forms. For our experiments, we consider SPARQL programs as our logical form. Formal definition of task: let q be a natural language query and let K be a symbolic KB that needs to be queried to retrieve an answer list A containing the answer(s) for q. We also assume access to a training set D = {(q 1 , 1 ), (q 2 , 2 ), . . . (q N , N )} of queries and their corresponding logical forms where q i , i represents query and its corresponding logical form, respectively. A logical form is an executable query containing entities, relations and free variables ( Figure 2). CBR-KBQA first retrieves K similar cases D q from D ( § 2.1). It then generates a intermediate logical form inter by learning to reuse components of the logical forms of the retrieved cases ( § 2.2). Next, the logical form inter is revised to output the final logical form by aligning to the relations present in the neighborhood subgraph of the query entity to recover from any spurious relations generated in the reuse step ( § 2.3). Finally, is executed against K and the list of answer entities are returned. We evaluate our KBQA system by calculating the accuracy of the retrieved answer list w.r.t a held-out set of queries.

Retrieve
The retrieval module computes dense representation of the given query and uses it to retrieve other similar query representation from a training set. Inspired by the recent advances in neural dense passage retrieval (Das et al., 2019;Karpukhin et al., 2020), we use a ROBERTA-base encoder to encode each question independently. Also, we want to retrieve questions that have high relational similarity instead of questions which share the same entities (e.g. we prefer to score the query pair (Who is Justin Bieber's brother?, Who is Rihanna's brother?), higher than (Who is Justin Bieber's brother?, Who is Justin Bieber's father?)). To minimize the effect of entities during retrieval, we use a named entity tagger 1 to detect spans of entities and mask them with [BLANK] symbol with a probability p mask , during training. The entity masking strategy has previously been successfully used in learning entity-independent relational representations (Soares et al., 2019). The similarity score between two queries is given by the inner product between their normalized vector representations (cosine similarity), where each representation, following standard practice (Guu et al., 2020), is obtained from the encoding of the initial [CLS] token of the query.
Fine-tuning question retriever: In passage retrieval, training data is gathered via distant supervision in which passages containing the answer is marked as a positive example for training. Since in our setup, we need to retrieve similar questions, we use the available logical forms as a source of distant supervision. Specifically, a question pair is weighed by the amount of overlap (w.r.t KB relations) it has in their corresponding logical queries. Following DPR (Karpukhin et al., 2020), we ensure there is at least one positive example for each query during training and use a weighted negative log-likelihood loss where the weights are computed by the F 1 score between the set of relations present in the corresponding logical forms. Concretely, let (q 1 , q 2 , . . . , q B ) denote all questions in a minibatch. The loss function is: Here, q i ∈ R d denotes the vector representation of query q i and sim(q i , q j ) = q i q j . w i,j is computed as the F 1 overlap between relations in the logical pairs of q i and q j . We pre-compute and cache the query representations of the training set D. For query q, we return the top-k similar queries in D w.r.t q and pass it to the reuse module.

Reuse
The reuse step generates an intermediate logical form from the k cases that are fed to it as input from the retriever module. Pre-trained encoderdecoder transformer models such as BART (Lewis et al., 2020a)   . We take a similar approach in generating an intermediate logical form conditioned on the retrieved cases. However, one of the core limitation of transformer-based models is its quadratic dependency (in terms of memory), because of full-attention, which severely limits the sequence length it can operate on. For example, BART and T5 only supports sequence length of 512 tokens in its encoder. Recall that for us, a case is a query from the train set and an executable SPARQL program, which can be arbitrarily long.
To increase the number of input cases, we leverage a recently proposed sparse-attention transformer architecture -BIGBIRD (Zaheer et al., 2020). Instead of having each token attend to all input tokens as in a standard transformer, each token attends to only nearby tokens. Additionally, a small set of global tokens attend to all tokens in the input. This reduces the transformer's memory complexity from quadratic to linear, and empirically, BIGBIRD enables us to use many more cases.
Description of input: The input query q and cases D q = {(q 1 , 1 ), (q 2 , 2 ), . . . (q k , k )} are concatenated on the encoder side. Specifically, where [SEP] denotes the standard separator token. Each logical form also contain the KB entity id of each entities in the question (e.g. m.03_r3 for Jamaica in Figure 2). We append the entity id after the surface form of the entity mention in the question string. For example, the query in Figure 2 becomes "What do Jamaican m.03_r3 people speak?".
Training is done using a standard seq2seq crossentropy objective. Large deep neural networks usually benefit from "good" initialization points (Frankle and Carbin, 2019) and being able to utilize pre-trained weights is critical for seq2seq models. We find it helpful to have a regularization term that minimizes the Kullback-Leibler divergence (KLD) between output softmax layers of (1) when only the query q is presented (i.e not using cases), and (2) when query and cases (D q ) are available (Yu et al., 2013). Formally, let f be the seq2seq model, let σ = sof tmax(f (q, D q )) and σ = sof tmax(f (q)) be the decoder's prediction distribution with and without cases, respectively. The following KLD term is added to the seq2seq cross-entropy loss where λ T ∈ [0, 1] is a hyper-parameter. Intuitively, this term regularizes the prediction of f (q, D q ) not to deviate too far away from that of the f (q) and we found this to work better than initializing with a model not using cases.

Revise
In the previous step, the model explicitly reuses the relations present in D q , nonetheless, there is no guarantee that the query relations in D q will contain the relations required to answer the original query q. This can happen when the domain of q and domain of cases in D q are different even when the relations are semantically similar. For example, in Figure 1 although the retrieved relations in NN queries are semantically similar, there is a domain mismatch (person v/s fictional characters). Similarly, large KBs are very incomplete (Min et al., 2013), so querying with a valid relation might require an edge that is missing in the KB leading to intermediate logical forms which do not execute.
To alleviate this problem and to make the queries executable, we explicitly align the generated relations with relations (edges) present in the local neighborhood of the query entity in the KG. We propose the following alignment models: Using pre-trained KB embeddings: KB completion is an extensively studied research field (Nickel et al., 2011;Bordes et al., 2013;Socher et al., 2013;Velickovic et al., 2018;Sun et al., 2019b) and several methods have been developed that learn low dimensional representation of relations such that similar relations are closer to each other in the embedding space. We take advantage of the pre-trained relations obtained from TransE (Bordes et al., 2013), a widely used model for KB completion. For each predicted relation, we find the most similar (outgoing or incoming) relation edge (in terms of cosine similarity) that exists in the KB for that entity and align with it. If the predicted edge exists in the KB, it trivially aligns with itself. There can be multiple missing edges that needs alignment ( Figure 1) and we find it more effective to do beam-search instead of greedy-matching the most similar edge at each step.
Using similarity in surface forms: Similar relations (even across domains) have overlap in their surface forms (e.g. 'siblings' is common term in both 'person.siblings' and 'fic-tional_character.siblings'). Therefore, word embeddings obtained by encoding these words should be similar. This observation has been successfully utilized in previous works (Toutanova and Chen, 2015;Hwang et al., 2019). We similarly encode the predicted relation and all the outgoing or incoming edges with ROBERTA-base model. Following standard practices, relation strings are prepended with a [CLS] token and the word pieces are encoded with the ROBERTA-base model and the output embedding of the [CLS] token is considered as the relation representation. Similarity between two relation representations is computed by cosine similarity.
Our alignment is simple and requires no learning. By aligning only to individual edges in the KB, we make sure that we do not change the structure of the generated LF. We leave the exploration of learning to align single edges in the program to sequence of edges (paths) in the KB as future work.

Experiments
Data: For all our experiments, the underlying KB is full Freebase containing over 45 million entities (nodes) and 3 billion facts (edges) (Bollacker et al., 2008). We test CBR-KBQA on three datasets -We-  Hyperparameters: All hyperparameters are set by tuning on the valdation set for each dataset. We initialize our retriever with the pre-trained ROBERTAbase weights. We set p mask = 0.2 for CWQ and 0.5 for the remaining datasets. We use a BIGBIRD generator network with 6 encoding and 6 decoding sparse-attention layers, which we initialize with pre-trained BART-base weights. We use k=20 cases and decode with a beam size of 5. Initial learning rate is set to 5 × 10 −5 and is decayed linearly through training. Further details for the EMNLP reproducibility checklist is given in §A.2.

Entity Linking
The first step required to generate an executable LF for a NL query is to identify and link the entities present in the query. For our experiments, we use a combination of an off-the-shelf entity linker and a large mapping of mentions to surface forms. For the off-the-shelf linker, we use a recently proposed high precision entity linker ELQ . To further improve recall of our system, we first identify mention spans of entities in the question by tagging it with a NER 2 system. Next, we link entities not linked by ELQ by exact matching with surface form annotated in FACC1 project (Gabrilovich et al., 2013). Our entity linking results are shown in Table 2.

KBQA Results
This section reports the performance of CBR-KBQA on various benchmarks. We report the strict exact match accuracy where we compare the list of predicted answers by executing the generated SPARQL program to the list of gold answers 3 . A question is answered correctly if the two list match exactly. We also report the precision, recall and the F1 score to be comparable to the baselines. Models such as GraftNet (Sun et al., 2018)   2019a) rank answer entities and return the top entity as answer (Hits@1 in table 1). This is undesirable for questions that have multiple entities as answers (e.g. "Name the countries bordering the U.S.?").
We also report performance of models that only depend on the query and answer pair during training and do not depend on LF supervision (weaklysupervised setting). Unsurprisingly, models trained with explicit LF supervision perform better than weakly supervised models. Our main baseline is a massive pre-trained seq2seq model with orders of magnitude more number of parameters -T5-11B (Raffel et al., 2020). T5 has recently been shown to be effective for compositional KBQA (Furrer et al., 2020). For each dataset, we fine-tune the T5 model on the query and the LF pairs. Table 1 reports results of various models on We-bQSP. All reported model except CBR-KBQA and T5-11B directly operate on the KB (e.g. traverse KB paths starting from the query entity) to generate the LF or the answer. As a result, models such as STAGG tend to enjoy much higher recall. On the other hand, much of our logical query is generated by reusing components of similar cases. We also report the results of 'aligning' the LF produced by T5 using our revise step. As shown in Table 1, CBR-KBQA outperforms all other models significantly and improves on the strict exact-match accuracy by more than 6 points w.r.t. the best model. Revise step also improves on the performance of T5 suggesting that it is generally applicable. Table 3 reports performance on the hidden test set of CWQ 4 , which was built by extending WebQSP 4 The result of our model in the official leaderboard (https://www.tau-nlp.org/ compwebq-leaderboard) is higher (70.4 vs 67.1). This is because the official evaluation script assigns full score if any of the correct answer entities are returned even if there are multiple correct answers for a question. In the paper we report strict exact-match accuracy.  questions with the goal of making a more complex multi-hop questions. It is encouraging to see that CBR-KBQA outperforms all other baselines on this challenging dataset by a significant margin. Finally, we report results on CFQ in Table 4. On error analysis, we found that on several questions which are yes/no type, our model was predicting the list of correct entities instead of predicting a yes or no. We created a rule-based binary classifier that predicts the type of question (yes/no or other). If the question was predicted as a yes/no, we would output a yes if the length of the predicted answer list was greater than zero and no otherwise. (If the model was already predicting a yes/no, we keep the original answer unchanged). We report results on all the three MCD splits of the dataset and compare with the T5-11B model of Furrer et al.
(2020) and we find that our model outperforms T5-11B on this dataset as well. It is encouraging to see that CBR-KBQA, even though containing order-of-magnitudes less parameters than T5-11B, outperforms it on all benchmarks showing that it is possible for smaller models with less carbon footprint and added reasoning capabilities to outperform massive pre-trained LMs. Table 5 show that the revise step is useful for CBR-KBQA on multiple datasets. We also show that the T5 model also benefits from the alignment in revise step with more than 3 points improvement in F1 score on the CWQ dataset. We find that TransE alignment outperforms ROBERTA based alignment, suggesting that graph structure information is more useful than surface form similarity for aligning relations. Moreover, relation names are usually short strings, so they do not provide enough context for LMs to form good representations. Next we demonstrate the advantage of the nonparametric property of CBR-KBQA-ability to fix an initial wrong prediction by allowing new cases to be injected to the case-memory. This allows CBR-KBQA to generalize to queries which needs relation never seen during training. Due to space constraints, we report other results (e.g. retriever performance), ablations and other analysis in §B.

Point-Fixes to Model Predictions
Modern QA systems built on top of large LMs do not provide us the opportunity to debug an erroneous prediction. The current approach is to finetune the model on new data. However, this process is time-consuming and impractical for production settings. Moreover, it has been shown (and as we will empirically demonstrate) that this approach leads to catastrophic forgetting where the model forgets what it had learned before. (McCloskey and Cohen, 1989;Kirkpatrick et al., 2017). On the other hand, CBR-KBQA adopts a nonparametric approach and allows inspection of the retrieved nearest neighbors for a query. Moreover, one could inject a new relevant case into the case-memory (KNN-index), which could be picked up by the retriever and used by the reuse module to fix an erroneous prediction.

Performance on Unseen Relations
We consider the case when the model generates a wrong LF for a given query. We create a controlled setup by removing all queries from the training set of WebQSP which contain the (people.person.education) relation. This led to a removal of 136 queries from the train set and ensured that the model failed to correctly answer the 86 queries (held-out) in the test set which contained the removed relation in its LF. We compare to a baseline transformer model (which do not use cases) as our baseline. As shown in Table 6, both baseline and CBR-KBQA do not perform well without any relevant cases since a required KB relation was missing during training. Next, we add the 136 training instances back to the training set and recompute the KNN index. This process involves encoding the newly added NL queries and recomputing the KNN index, a process which is computationally much cheaper than re-training the model again. Row 5 in Table 6 What colors do the school where Donald Stanley Marshall is grad student use? shows the new result. On addition of the new cases, CBR-KBQA can seamlessly use them and copy the unseen relation to predict the correct LF, reaching 70.6% accuracy on the 86 held-out queries.
In contrast, the baseline transformer must be fine-tuned on the new cases to handle the new relation, which is computationally more expensive than adding the cases to our index. Moreover, just finetuning on the new instances leads to catastrophic forgetting as seen in row 2 of Table 6 where the baseline model's performance on the initial set decreases drastically. We find it necessary to carefully fine-tune the model on new examples alongside original training examples (in a 1:2 proportion). However, it still converges to a performance which is lower than its original performance and much lower than the performance of CBR-KBQA.

Human-in-the-Loop Experiment
During error analysis, we realized that there are queries in the test set of WebQSP that contain KB relations in their LFs which were never seen during training 5 . That means models will never be able to predict the correct LF for the query because of the unseen relation. We conduct a human-inthe-loop experiment (Figure 3)     NL query paired with a program which only contain one KB relation.  Table 14 in Appendix D shows various statistics of the missing relations and the number of cases added by humans and from SimpleQuestions. The cases are added to the original KNN-index. By adding a few cases, the performance increases from 0 to 36 F1 (Table 7) without requiring any training. Note unlike the previous controlled experiment in §3.4.1, we add around 3.87 cases for each unseen relation 6 . Importance of this result: We believe that flexibility of models to fix predictions (without training) is an important desideratum for QA systems deployed in production settings and we hope our results will inspire future research in this direction.

Further Analysis
We analyze questions in the evaluation set which require novel combinations of relations never seen in 6 In §3.4.1, we added 136 cases (v/s 3.87) for one relation. This is why the accuracy in Table 6 is higher w.r.t Table 7   the training set. This means, in order for our model to answer these questions correctly, it would have to retrieve relevant nearest neighbor (NN) questions from the training set and copy the required relations from the logical form of multiple NN queries. Table 8 shows that our model outperforms the competitive T5 baseline. Also as we saw in the last section, our model is able to quickly adapt to relations never seen in the training set altogether by picking them up from newly added cases. We also compare with a model with the same reuse component of CBR-KBQA but is trained and tested without retrieving any cases from the casememory (Table 9). Even though the baseline model is competitive, having similar cases is beneficial, especially for the WebQSP dataset. We also report the results when we only use cross-entropy loss for training the BIGBIRD model and not the KLdivergence term. Table 10 reports the performance of CBR-KBQA using different number of retrieved cases. It is encouraging to see the the performance improves with increasing number of cases.

Related Work
Retrieval augmented QA models (Chen et al., 2017;Guu et al., 2020;Lewis et al., 2020b) augments a reader model with a retriever to find rel-evant paragraphs from a nonparametric memory. In contrast, our CBR approach retrieves similar queries and uses their logical forms to derive a new solution. Recently Lewis et al. (2020c) proposed a model that finds a nearest neighbor (NN) question in the training set and returns the answer to that question. While this model would be helpful if the exact question or its paraphrase is present in the training set, it will not generalize to other scenarios. CBR-KBQA, on the other hand, learns to reason with the retrieved programs of multiple retrieved NN queries and generates a new program for the given query and hence is able to generalize even if the query paraphrase is not present in the train set. Retrieve and edit: CBR-KBQA shares similarities with the RETRIEVE-AND-EDIT framework (Hashimoto et al., 2018) which utilizes retrieved nearest neighbor for structured prediction. However, unlike our method they only retrieve a single nearest neighbor and will unlikely be able to generate programs for questions requiring relations from multiple nearest neighbors. Generalizing to unseen database schemas: There has been work in program synthesis that generates SQL programs for unseen database schemas (Wang et al., 2020;Lin et al., 2020). However, these work operate on web or Wikipedia tables with small schemas. For example, in WikiTable-Questions (Pasupat and Liang, 2015) the average. number of columns in a table is 5.8 and in Spider dataset (Yu et al., 2018), the average number of columns is 28.1. On the other hand, our model has to consider all possible Freebase relations (in thousands). Previous work perform schema-aware encoding which is not possible in our case because of the large number of relations. The retrieve step of CBR-KBQA can be seen as a pruning step which narrows the number of candidate relations by retrieving relevant questions and their logical forms. Case-based Reasoning for KB completion: Recently, a CBR based KB reasoning approach was proposed by Das et al. (2020a,b). They retrieve similar entities and then find KB reasoning paths from them. However, their approach does not handle complex natural language queries and only operate on structured triple queries. Additionally, the logical forms handled by our model have much more expressive power than knowledge base paths. Program Synthesis and Repair: Repairing / revising generated programs has been studied in the field of program synthesis. For example, prior work repairs a program based on syntax of the underlying language (Le et al., 2017), by generating sketches (Hua et al., 2018). More recently, Gupta et al. (2020) proposes a framework in which they use a program debugger to revise the program generated by a neural program synthesizer. However, none of these works take advantage of the similarity between semantic relations present in the knowledge base, and hence, unlike us, they do not use embeddings of similar relation to align relations. More generally, many prior efforts have employed neural models to generate SPARQL-like code for semantic parsing (Dong and Lapata, 2016;Balog et al., 2016;Zhong et al., 2017a), SQL queries over relational databases (Zhong et al., 2017b), program-structured neural network layouts (Andreas et al., 2016), or even proofs for mathematical theorems (Polu and Sutskever, 2020). Our work differs in our use of the programs of multiple retrieved similar queries to generate the target program. K-NN approach in other NLP applications: Khandelwal et al. (2020) demonstrate improvements in language modeling by utilizing explicit examples from training data. There has been work in machine translation Gu et al., 2018;Khandelwal et al., 2021) that uses nearest neighbor translation pair to guide the decoding process. Recently, Hossain et al. (2020) proposed a retrieve-edit-rerank approach for text generation in which each retrieved candidate from the training set is edited independently and then re-ranked. In contrast, CBR-KBQA generates the program jointly from all the retrieved cases and is more suitable for questions which needs copying relations from multiple nearest neighbors. Please refer to ( §E) for further related work.

Limitations and Future Work
To the best of our knowledge, we are the first to propose a neuralized CBR approach for KBQA. We showed that our model is effective in handling complex questions over KBs, but our work also has several limitations. First, our model relies on the availability of supervised logical forms such as SPARQL queries, which can be expensive to annotate at scale. In the future, we plan to explore ways to directly learn from question-answer pairs (Berant et al., 2013;Liang et al., 2016). Even though, CBR-KBQA is modular and has several advantages, the retrieve and reuse components of our model are trained separately. In future, we plan to explore avenues for end to end learning for CBR.

A.2 Hyperparameters
The WebQSP dataset does not contain a validation split, so we choose 300 training instances to form the validation set. We use grid-search (unless explicitly mentioned) to set the hyperparameters listed below. Case Retriever: We initialize our retriever with the pre-trained ROBERTA-base weights. We set the initial learning rate to 5 × 10 −5 and decay it linearly throughout training. We evaluate the retriever based on the percentage of gold LF relations in the LFs of the top-k retrieved cases (recall@k). We train for 10 epochs and use the best checkpoint based on recall@20 on the validation set. We set train and validation batch sizes to 32. For p mask , we try values from [0, 0.2, 0.4, 0.5, 0.7, 0.9, 1]. When training the retriever, we found p mask = 0.2 works best for COMPLEXWEBQUES-TIONS and p mask = 0.5 for the remaining datasets.
Seq2Seq Generator: We use a BIGBIRD generator network with 6 encoding and 6 decoding sparse-attention layers, which we initialize with

Dataset
Validation Acc WebQSP 71.5 CWQ 82.8 CFQ 69.9 Table 12: Validation set accuracy of models corresponding to the results reported in the paper pre-trained BART-base weights. We set the initial learning rate to 5 × 10 −5 and decay it linearly throughout training. Accuracy after the execution of generated programs on the validation set is used to select the optimal setting and model checkpoint.
For λ T , we perform random search in range [0, 1]. We finally use λ T =1.0 for all datasets. For k (number of cases), we search over the values [1,3,5,7,10,20]. For all datasets, we use k=20 cases and decode with a beam size of 5 for decoding. The WebQSP model was trained for 15K gradient steps and all other models were trained for 40K gradient steps.
Computing infrastructure: We perform our experiments on a GPU cluster managed by SLURM. The case retriever was trained and evaluated on NVIDIA GeForce RTX 2080 Ti GPU. The models for the Reuse step were trained and evaluated on NVIDIA GeForce RTX 8000 GPUs. Revise runs on NVIDIA GeForce RTX 2080 Ti GPU when using ROBERTA for alignment and runs only on CPU when using TRANSE. We report validation set scores in Table 12.

B.1 Performance of Retriever
We compare the performance of our trained retriever with a ROBERTA-base model. We found that ROBERTA model even without any fine-tuning performs well at retrieval. However, fine-tuning ROBERTA with our distant supervision objective improved the overall recall, e.g., from 86.6% to 90.4% on WEBQUESTIONSSP and from 94.8% to 98.4% on CFQ.

B.2 Performance on Unseen Entities
In Table 7 we showed CBR-KBQA is effective for unseen relations. But what about unseen entities in the test set?. On analysis we found that in WebQSP, CBR-KBQA can copy unseen entities correctly 86.8% (539/621) from the question. This is +1.9% improvement from baseline trans-former model which is able to copy correctly 84.9% (527/621) of the time. Note that unseen entities can be copied from the input NL query and we do not need additional cases to be injected to KNN index.

B.3 Analysis of the Revise Step
In the revise step, we attempt to fix programs predicted by our reuse step that did not execute on the knowledge base. The predicted program can be syntactically incorrect or enforce conditions that lead to an unsatisfiable query. In our work, we focus on predicted programs that can be fixed by aligning clauses to relations in the local neighborhood of query entities. We give examples of successful alignments Table 16 as well as failed attempts at alignment Table 17.

C Details on Held-Out Experiments
In this section, we include more details about our held-out experiment described in section 3.4.1. The goal of this experiment is to show that our approach can generalize to unseen relations without requiring any further training of the model. This is a relevant setting to explore, because real-world knowledge bases are often updated with new kinds of relations, and we would like KBQA systems that adapt to handle new information with minimal effort.
We explicitly hold-out all questions containing a particular relation from the datasets. Table 13 shows the relation type and the number of questions that are removed as a result of removing the relation.

D Details on Automated Case Collection and Human-in-the-Loop Experiments
While conducting analysis, we also noticed that WebQSP has queries in the test set for which the required relations are never present in the training set. This gives us an opportunity to conduct real human-in-the-loop experiments to demonstrate the advantage of our model. To add more cases, we resort to a mix of automated data collection and human-in-the-loop strategy. For each of the missing relation, we first try to find NL queries present  . Sim-pleQuestions (SQ) is a large dataset containing more than 100K NL questions that are 'simple' in nature -i.e. each NL query maps to a single relation (fact) in the Freebase KB. For each missing relation type, we try to find questions in the SQ dataset that can be mapped to the missing relation. However, even SQ has missing coverage in which case, we manually generate a question and its corresponding SPARQL query by reading the description of the relation. Table 14 shows the number of questions in the evaluation set which at least has a relation never seen during training and also the number of cases that has been added. For example, we 7 were able to collect 292 questions from SQ and we manually created 72 questions for WebQSP. Overall, we add 3.87 new cases per query relation for WebQSP. Table 15 shows some example of cases added manually or from SQ. We look up entity ids for entities from the FACC1 alias table ( §3.1). Also note, that since we only add questions which are simple in nature, the corresponding SPARQL query can be easily constructed from the missing relation type and the entity id.
Importance of this result: Through this experiment, we demonstrate two important properties of our model -interpretability and controllability. Database schemas keep changing and new tables keep getting added to a corporate database. When our QA system gets a query wrong, by looking at the retrieved K-nearest neighbors, users can deter-    (2020) recently proposed a meta-learning approach which utilizes cases retrieved w.r.t. the similarity of the input. However, their main goal is to learn a better parametric model (retriever and generator) from neighboring cases rather than composing and fixing cases to generate answers at test time.
Question Decomposition One strategy to answer a complex question is to first break it down into simpler subquestions, each of which can be viewed as a natural language program describing how to answer the question. This approach has been shown to be effective as far back as IBM Watson (Ferrucci et al., 2010) to more recent systems for answering questions about text (Das et al., 2019;Min et al., 2019;Perez et al., 2020;Wolfson et al., 2020) or knowledge bases (Talmor and Berant, 2018). These prior studies do not leverage case-based reasoning when generating decompositions and thus may also benefit from similar techniques as proposed in our work.