Do I have the Knowledge to Answer? Investigating Answerability of Knowledge Base Questions

When answering natural language questions over knowledge bases, missing facts, incomplete schema and limited scope naturally lead to many questions being unanswerable. While answerability has been explored in other QA settings, it has not been studied for QA over knowledge bases (KBQA). We create GrailQAbility, a new benchmark KBQA dataset with unanswerability, by first identifying various forms of KB incompleteness that make questions unanswerable, and then systematically adapting GrailQA (a popular KBQA dataset with only answerable questions). Experimenting with three state-of-the-art KBQA models, we find that all three models suffer a drop in performance even after suitable adaptation for unanswerable questions. In addition, these often detect unanswerability for wrong reasons and find specific forms of unanswerability particularly difficult to handle. This underscores the need for further research in making KBQA systems robust to unanswerability.


Introduction
The problem of natural language question answering over knowledge bases (KBQA) has received a lot of interest in recent years (Saxena et al., 2020;Zhang et al., 2022;Mitra et al., 2022;Wang et al., 2022;Cao et al., 2022b;Ye et al., 2022;Chen et al., 2021;Das et al., 2021). An important yet neglected aspect of this task for realworld deployment is the unanswerability of the posed questions. This problem arises naturally since KBs are almost always incomplete in different ways (Min et al., 2013) and users of such question answering systems are normally unaware of such incompleteness when posing queries. As a result, a significant fraction of questions posed to real-world KBQA systems may be unanswerable. In such cases, it is important for a model to detect and report that a question is unanswerable instead of generating an incorrect answer.
We first identify the different forms of KB incompleteness that lead to unanswerability. Using GrailQA (Gu et al., 2021), which is the largest and most diverse KBQA benchmark dataset, we first create a new benchmark for KBQA with unanswerable questions. This involves addressing a host of challenges, because of dependencies between the different forms of KB incompleteness, and since answerability of different questions are coupled through the shared KB. We address these challenges systematically to create this new benchmark, which we call GrailQAbility. This also includes different forms of test scenarios for unanswerable questions, such as IID and zero-shot, mirroring those for answerable questions. Our approach is generic and can be used to similarly enrich other KBQA benchmarks.
We then use GrailQAbility to evaluate the robustness of two recent state-of-the-art KBQA models, RnG-KBQA (Ye et al., 2022) and ReTraCk (Chen et al., 2021), for detecting answerability while answering KB questions. Overall, we find that both models struggle to detect answerability even when retrained with unanswerable questions and suffer from a significant drop in performance. Our analysis shows interesting differences in performance between the two models across different subsets of unanswerable questions, and thus serve as first steps towards developing new KBQA models that are robust to this very real challenge. QA. An answerable question is shown for this KB with logical form l in s-expression, answer a and path (shaded blue). (B) 5 types of unanswerable questions for provided KB, with actual logical forms l, answers a and ideal logical forms l * with missing KB elements in red (3-5). (C) Illustration of 3 different types of unanswerability scenario in test.
In summary, our contributions are as follows: (a) We motivate and introduce the task of detecting answerabilty for KBQA. (b) We create GrailQAbility which is the first benchmark for KBQA incorporating different types and test scenarios for unanswerable questions. Our approach is generic and can be used to similarly enrich other KBQA benchmarks. (c) Experiments on GrailQAbility show that two state-of-the-art KBQA models struggle to detect answerability. Comparative analysis between the two models reveals useful insights towards development of better KBQA models that are robust against unanswerability.

KBQA with Answerability Detection
A Knowledge Base (KB) (also called Knowledge Graph) G contains a schema S (or ontology) with entity types (or types) T and relations R defined over pairs of types. The types in T are often organized as a hierarchy. It also contains entities E as instances of types, and facts (or triples) F ⊆ E × R × E. (A KB also contains literals, which we ignore for brevity of notation.) The top layer of Fig. 1(A) shows example schema elements, while the bottom layer shows entities and facts.
In Knowledge Base Question Answering (KBQA), we are given a query q written in natural language which is translated to a logical form (or query) l that executes over G to yield an answer a. Since a question may have many answers, a is a set. In this paper, we will consider logical forms in the form of s-expressions (Gu et al., 2021), which can be easily translated to KG query languages such as SPARQL, and provide a balance between readability and compactness (Gu et al., 2021). We call an s-expression valid for G if it executes over G without an error. When an s-expression executes successfully, it traces a path over facts in the KB leading to the answer. Fig. 1(A) shows an example query with a valid s-expression and the path traced by its execution. When a query has multiple answers, it has a corresponding path for each answer.
We define a question q to be answerable for a KB G, if (a) q admits a valid logical form l for G, AND (b) l returns a non-empty answer a when executed over G. Note that this definition does not address correctness or completeness of the facts in G and therefore of the answer. The standard KBQA task over a KB G is to output the answer a, and optionally the logical form l, given a question q, assuming q to be answerable for G. Most recent KBQA models (Ye et al., 2022;Chen et al., 2021) are trained with questions, answers and logical forms. Other models directly generate the answer (Sun et al., 2019;Saxena et al., 2022). Different train-test settings -iid, zero-shot and compositional generalization -have been explored and are included in benchmark KBQA datasets (Gu et al., 2021).
By negating the above definition, we define a question q to be unanswerable for a KB G if (a) q does not admit a valid logical form l for G, OR (b) the valid l when executed over the G returns an empty answer. For a meaningful query, unanswerability arises due to different forms of deficiency in the available KB G. We assume that there exists an 'ideal KB' G * for which q admits a valid ideal logical form l * which executes on G * to generate a non-empty ideal answer a * . The available KB G lacks some schema element or fact relative to G * which makes q unanswerable. Fig. 1(A) illustrates an ideal KB and a given KB with the missing elements shown in red. In Fig. 1(B), questions 2 and 3 yield valid queries over the available KB but are missing facts to return an answer, while questions 4-6 lack schema elements for valid queries. The task of KBQA with answerability detection, given a question q and an available KB G is to (a) appropriately label the answer a as NA (No Answer) or the logical form l as NK (No Knowledge) when q is unanswerable for G, OR (b) generate the correct non-empty answer a and valid logical form l when q is answerable for G. (For models that directly generate the answer, l may be left out.) The training set may now additionally include unanswerable questions labeled appropriately with a =NA or l =NK. Note that training instances contain only valid logical forms when they exist for the available KB (for answerable questions and for unanswerable questions with valid logical forms), but not the 'ideal' logical forms for unanswerable questions.
In additional to the different test scenarios for answerable questions, different test scenarios exist for unanswerable questions as well in this new task. Fig. 1(C) illustrates the three scenarios for unanswerable questions, which we call iid, full zero-shot and partial zero-shot. IID unanswerable questions in the test follow the same distribution over unanswerable questions as in train. In contrast, zeroshot unanswerable questions are generated from a different distribution over unanswerable questions than that in train. These are expected to form a large fraction of unanswerable questions in test sets from real KB applications. The ideal logical form for these questions contains at least one 'unseen missing' schema element -one that does not appear in the ideal logical form of any unanswerable question in train. If all of the schema elements in the ideal logical form are 'unseen missing', then we term it full zero-shot. Otherwise, if the ideal logical form contains both seen KB elements from answerable train questions and unseen missing KB elements, we term it partial zero-shot. Note that KBQA datasets do not make this distinction for answerable questions. For unanswerable questions, our experiments show that these are not equally difficult. (Note that we have skipped the notion of composability for unanswerable questions. We discuss this further in Sec. 5.)

GrailQAbility: Extending GrailQA with Answerability Detection
In this section, we describe the creation of a new benchmark dataset for KBQA with unanswerable questions.
We assume the given KB to be the ideal KB G * and the given logical forms and answers to be the ideal answers A * and ideal logical forms L * for the questions Q. We then create a KBQA dataset Q au with answerable and unanswerable questions with an 'incomplete' KB G au by iteratively dropping KB elements from G * . Prior work on QA over incomplete KBs has explored algorithms for dropping facts from KBs (Saxena et al., 2020;Thai et al., 2022). We extend this for all types of KB elements (type, relation, entity and fact) and explicitly track and control unanswerability. At step t, we sample a KG element g from the current KB G t−1 au , identify all questions q in Q t−1 au whose current logical form l t−1 or path p t−1 contains g, and remove g from it. Since q may have multiple answer paths, this may only eliminate some answers from a t−1 but not make in empty, thereby making q unanswerable. If q becomes unanswerable, we mark it appropriately (with a t =NA or l t =NK) and update G t au = G t−1 au \{g}. This process is continued until Q t contains a desired percentage p u of unanswerable questions.
One of the important details is sampling KB element g to drop. In an iterative KB creation or population process, whether manual or automated, popular KB elements are less likely to be missing at any time. Therefore we sample g according to inverse popularity in G * . However, the naive sampling process is inefficient since it is likely to affect the same questions across iterations or not affect any question at all. So, the sampling additionally considers the importance of g for Q t au -the number of questions in Q t au whose current logical form or answer contains g.
Once a dataset Q au is created with p u unanswerable questions, it is split into train Q T r au and test Q T au , such that Q T au contains percentages p i u , p f z u and p pz u of iid, full zero-shot and partial zero-shot questions, with p u = p i u +p f z u +p pz u . We first select and set aside the partial and full zero-shot questions in Q T au and then split the remaining questions in Q au into Q T r au and the iid questions in Q T au . We select the zero-shot questions by sampling dropped KB elements g d from G d = G * \G au . All questions q ∈ Q au that contain g d are added to either partial or full zero-shot questions in Q T au . We stop when the desired percentages p f z u and p pz u are reached. Any questions in Q au that still contain these zero-shot KB elements are removed to ensure that these do not get added to training.The remaining questions in Q au are split into Q T r au and n i u (= p i u ×|Q au |) iid questions for Q T au . We finally split Q T au into test and development. Note that this algorithm only ensures that the actual percentages are close to the desired percentages.
Next we first describe the specifics for individual types of KB elements and then how we drop all types of KB elements in the same dataset.
Fact Drop: Dropping a fact f from G t au does not affect the logical form for any question q ∈ Q t au . However, it can eliminate an answer path p t for q. Since q can have multiple answer paths, it becomes unanswerable when all its answer paths are eliminated. If q becomes unanswerable, in Q t au we set its answer a t to NA, but leave its logical form unchanged. For selecting facts to drop, we consider all facts to be equally popular.
Entity Drop: To drop an entity e from G t au , we first drop all facts f associated with e, and then drop e itself. Dropping facts affects answerability of questions as above. Dropping e additionally affects answerability of questions q that involve e as one of the mentioned entities. For q, the logical form l t also becomes invalid. So we set l t =NK in Q t au . As for facts, we consider all entities to be equally popular.
Relation Drop: To drop a relation r from G t au , we first drop all facts f associated with r, and then drop r from the schema of G t au . Dropping facts makes some questions q unanswerable as above, and we set a t =NA in Q t au . Dropping r additionally affects the logical form of q, and we set l t =NK in Q t au . Popularity of relation r is taken as the number of facts associated with r in G * .

Entity Type Drop:
Dropping an entity type is the most complex operation. Entities are often tagged with multiple types in a hierarchy (e.g Researcher and Person). To drop an entity type t ∈ τ from G t au , we drop all entities e that are associated only with t, and also all relations r associated with t. Entities that are still associated with remaining types in G t au are preserved. For an affected question q, we set a t =NA and l t =NK in Q t au . The popularity of an entity type t is defined as the number of facts in G * associated with t or any of its descendant types.
Combining Drops: Our final objective is a dataset Q au that contains p u percentage of unanswerable questions with contributions p f u , p e u , p r u and p t u from the four types of incompleteness. Starting with the original questions Q * and KB G * , we execute type drop, relation drop, entity drop and fact drop with the corresponding percentage in sequence, in each step operating on the updated dataset and KB. For analysis, we label questions with the drop type that caused unanswerability. Note that a question may be affected by multiple types of drops at the same time.  GrailQAbility Dataset: We run the combined drop algorithm on GrailQA (Gu et al., 2021), which is the largest and most diverse KBQA benchmark based on Freebase but contains only answerable questions, to create a new benchmark for KBQA with answerability detection. We call this GrailQAbility (GrailQA with Answerability). We will make this dataset public. Our dataset creation approach is generic and can extended to other KBQA benchmarks.
Since the test questions for GrailQA are unavailable, we utilize the the train and dev sets as original set of questions Q * . Aligning with earlier QA datasets with unanswerability, we keep the total percentage of unanswerable questions p u as 33%, splitting this equally (8.25%) between dropped entity types, relations, entities and facts. We split the questions into train, test and validation as 70%, 20% and 10% such that each contains 33% unanswerable questions. The unanswerable questions in test and dev contain 50% iid, and 50% zero-shot, which is further split into 75% partial zero-shot and 25% full zero-shot. Statistics for GrailQAbility and GrailQA are compared in Tab. 1. Sizes of the different splits are shown in Tab. 2.

Experimental Setup
We now discuss the KBQA models that we choose for experimentation and how we adapt these for detecting answerability.
KBQA Models Among state-of-the-art KBQA models, we pick RnG-KBQA (Ye et al., 2022) and ReTraCk (Chen et al., 2021). These report state-of-the-art results on GrailQA as well as on WebQSP (Berant et al., 2013;Yih et al., 2016;Talmor and Berant, 2018) -the two main benchmarks. These are also among the top published models on the GrailQA leader board 1 . Being based on logical form generation, we expect these to be better able to detect fact and entity level incompleteness than purely retrieval-based approaches (Saxena et al., 2020;Das et al., 2021;Zhang et al., 2022;Mi-1 https://dki-lab.github.io/GrailQA/ tra et al., 2022; Wang et al., 2022). Also, importantly, these have important differences in their architecture, relevant for answerability. We expect the sketch generation based approach (Cao et al., 2022b) also to have some ability to detect answerability but unfortunately this does not have working code available yet 2 .
RnG-KBQA (Ye et al., 2022) follows a rank and generate approach, where a BERT-based (Devlin et al., 2019) ranker selects a set of candidate logical forms for a question by searching the KB, and then a T5-based (Raffel et al., 2020) model generates the logical form using the question and candidates.
ReTraCk (Chen et al., 2021) also uses a rank and generate approach, but uses a dense retriever to retrieve schema elements for a question, and grammar-guided decoding using an LSTM (Hochreiter and Schmidhuber, 1997) to generate the logical form using the question and retrieved schema items. Though both models check execution for generated logical forms, only ReTraCk returns empty LF when execution fails, which we interpret as NK prediction.
For both models, we use existing code from github 3 4 and augment with thresholding for entity disambiguation and logical form generation.
Adapting for Answerability: We adapt these models answerability in two different ways. In the simplest setting (A training), we train the models as in their original setting with only the answerable subset of training questions, leaving out the unanswerable questions. We also train by using both the answerable and answerable questions in the training data (A+U training). Additionally, we use thresholding on output probability to generate l =NK predictions. Specifically, we introduce two different thresholds for entity disambiguation and logical form. In addition to explicit NK predictions, we consider the prediction to be so MAP probabilities for entity disambiguation and logical form are less than their corresponding thresholds.

Evaluation Measures
We evaluate both the answers and the logical form. Since a question may have multiple answers, we evaluate predicted an-  Table 3: Performance of different models on GrailQAbility over all, answerable and unanswerable questions. EM is exact match on logical forms and F1(L) and F1(R) are lenient and regular evaluations of answers. A and A+U indicate training with only answerable questions and with both answerable and unanswerable questions. Models with suffix +T have additional thresholds for entity linking and LF generation fine-tuned on GrailQAbility dev set. swers using precision, recall and F1 (Ye et al., 2022). In regular answer evaluation (R), we compare the predicted answer for a question q only with its possibly modified gold answer a au in the unaswerability dataset Q au . However, some model may still be able to predict the 'ideal' answer a for q in the original dataset Q, perhaps by inferring missing facts. To give credit to such models, we propose a lenient answer evaluation (L), where we compare the predicted answer against both a au and a. For evaluating logical forms, as usual, we compare the predicted logical form with the goldstandard one in Q au using exact match (EM) (Ye et al., 2022).

Results and Discussion
Here, we present high-level experimental results on GrailQAbility. Tab. 3 shows the high-level performance for RnGKBQA and ReTrack on answerable and unanswerable questions. We first observe that the base models trained only with answerable (A training) questions show expected performance for answerable questions (A) but quite poorly for unanswerable questions (U). Performance improves with thresholding, further with A+U training, and further still with thresholding after A+U training. However, highest EM performance for U (64.5%) is significantly lower than the best A performance (83.4%). Answer quality in terms of F1(R) is higher by at least 22 pct points, implying that the models correctly predict NA (no answer) but for incorrect reasons. Answers are better according to lenient evaluation F1(L) by at least 5 pct points. This indicates that models are often able to recover the ideal answer in spite of incompleteness. Between the two models, RnG outperforms ReTraCk in terms of logical form (EM) with thresholding but not otherwise. In terms of answers however, ReTraCk is always better, meaning that it is more susceptible to spurious reasoning. The one case were thresholding hurts is for RnG for answerable questions with both A and A+U training.
Discussion: We next aim to explain the results in terms of the architecture of the two models. The candidate path generator for RnG-KBQA and the constrained decoder for ReTraCk are sensitive to missing data since these rely on data-level paths, and are therefore particularly vulnerable to datalevel incompleteness. It should also be useful to experiment with sketch-based models (Cao et al., 2022b), which look to fill in KB specific elements only after generating a canonical logical form at a meta-level. These have so-far been shown to be useful for transferability in KBQA. These should be more robust against both data-level and schemalevel incompleteness.
Note than we have not considered compositional unanswerable questions. We may define a compositional unanswerable question as one that contains more than one missing KB element in its logical form all of which have appeared in unanswerable training instances, but not in the same instance. Thus the composition operation needed for detecting answerability is a simple OR condition. This should not be any more challenging than detecting IID unanswerability.
We have considered different ways of adapting KBQA models for unanswerability. We briefly touch upon the practicality of these settings. Though retraining with unanswerable question followed by thresholding shows the best performance in our experiments, we had the advantage of having knowledge of the ideal KB for creating the dataset. Creating such a dataset in the absence of an ideal KB will present many challenges. In view of this, the thresholded models with A+U training assume bigger significance. Few-shot answerability training is also a practical direction to explore.
6 Related Work KBQA models: There has been extensive research on KBQA in recent years. Retrieval based approaches (Saxena et al., 2020;Zhang et al., 2022;Mitra et al., 2022;Wang et al., 2022; learn to identify paths in the KB starting from entities mentioned in the question, and then score and analyze these paths to directly retrieve the answer. Query generation approaches (Cao et al., 2022b;Ye et al., 2022;Chen et al., 2021;Das et al., 2021) learn to generate a logical form or a query (e.g in SPARQL) based on the question, which is then executed over the KB to obtain the answer. Some of these retrieve KB elements first and then use these in addition to the query to generate the logical form (Ye et al., 2022;Chen et al., 2021). (Cao et al., 2022b) first generate a KB independent program sketch and then fill in specific arguments by analyzing the KB. None of these address answerability and have so far only been evaluated for answerable questions. Work on QA over incomplete KBs focus on improving accuracy of QA, using case-based reasoning and missing fact prediction, without addressing answerability (Thai et al., 2022;Saxena et al., 2020). Separately, these focus only on missing facts, while we investigate other aspects of incompleteness affecting answerability.
Answerability in QA: Answerability has been explored for extractive QA (Rajpurkar et al., 2018), conversational QA (Choi et al., 2018;Reddy et al., 2019), boolean (Y/N) QA (Sulem et al., 2022) and MCQ (Raina and Gales, 2022). While our work is motivated by these, the nature of unanswerable questions is very different for KBs compared to unstructured context. QA Datasets and Answerability: Many benchmark datasets exist for KBQA (Gu et al., 2021;Yih et al., 2016;Talmor and Berant, 2018;Cao et al., 2022a), but only contain answerable questions. Unanswerable questions have been incorporated into other QA datasets (Rajpurkar et al., 2018;Sulem et al., 2022;Reddy et al., 2019;Choi et al., 2018;Raina and Gales, 2022). These are typically achieved by pairing one question with the context for another question. Incorporating unanswerable questions in KBQA involves modifying the shared KB for with interdependent operations and is thus significantly more challenging.

Conclusions
In summary, we have introduced and motivated the task of detecting answerability when answering questions over a KB. We have created GrailQAbility as the first benchmark dataset for KBQA with unanswerable questions. Using experiments on GrailQAbility, we show that state-of-the-art KBQA models are not good at detecting answerability in general. We believe that our work and dataset will inspire and enable research towards developing more robust KBQA models.