The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models

Large language models (LLMs) have been shown to possess impressive capabilities, while also raising crucial concerns about the faithfulness of their responses. A primary issue arising in this context is the management of (un)answerable queries by LLMs, which often results in hallucinatory behavior due to overconfidence. In this paper, we explore the behavior of LLMs when presented with (un)answerable queries. We ask: do models represent the fact that the question is (un)answerable when generating a hallucinatory answer? Our results show strong indications that such models encode the answerability of an input query, with the representation of the first decoded token often being a strong indicator. These findings shed new light on the spatial organization within the latent representations of LLMs, unveiling previously unexplored facets of these models. Moreover, they pave the way for the development of improved decoding techniques with better adherence to factual generation, particularly in scenarios where query (un)answerability is a concern.


Introduction
Modern large language models (LLMs) have been tantalizing the NLP community in the last couple of years (Brown et al., 2020;Chen et al., 2021;Chung et al., 2022), demonstrating great potential for both research and commercial use, but these models are of course not problem-free.Among their unfavorable behaviors it is possible to find toxicity (Welbl et al., 2021;Deshpande et al., 2023), bias (Nadeem et al., 2021;Abid et al., 2021), and hallucination (Manakul et al., 2023;Ji et al., 2023).
One of the settings in which LLMs are notoriously prone to hallucinate is when presented with (un)answerable questions (Sulem et al., 2021; Asai   1 Our code is publicly available at https://github.com/lovodkin93/unanswerability and Choi, 2021; Amayuelas et al., 2023).Recent works in this setting, which is the focus of this work, suggested using models' confidence as an indication of answerability (Yin et al., 2023), and some suggested further finetuning to enhance the probability of detecting (un)answerable questions (Jiang et al., 2021;Kadavath et al., 2022).We, however, ask whether models already represent questions' (un)answerablility when producing answers, and find strong evidence for a positive answer.
Specifically, by experimenting with three QA datasets (Rajpurkar et al., 2018;Kwiatkowski et al., 2019;Trivedi et al., 2022), we observe a substantial increase in performance for (un)answerable questions (up to 80%) simply by incorporating to the prompt the possibility of (un)answerability.We further show that, even in the absence of guidance in the prompt, the fact that the question is (un)answerable is decodable from the model's representations.This is done by two methods: first, we find that the beam of decoded responses for (un)answerable queries often contains a response recognizing their (un)answerability; second, we demonstrate that the fact that the question is (un)answerable is easily decodable from the model's representations and that there is a linear separation between representations of (un)answerable and (un)answerable questions (see Figure 1).The existence of the answerability subspace is largely independent of the specific QA dataset used, in the sense that an answerability classifier trained over representations of questions from one dataset can successfully classify (un)answerable questions from other datasets as well.In addition to providing illuminating insights into the internal mechanics of LLMs, these findings also open up new avenues for better decoding methods (Meister et al., 2020;Wiher et al., 2022) to improve performance in general and on (un)answerable questions in particular.

Related Work
In previous research, (un)answerable questions were used to evaluate reasoning capabilities (Rajpurkar et al., 2018;Ferguson et al., 2020;Kwiatkowski et al., 2019;Trivedi et al., 2022).It was SQuAD v2 (Rajpurkar et al., 2018) that provided the first reading comprehension dataset for validating models' ability to deal with (un)answerability, by introducing questions that cannot be addressed from the given context.Kwiatkowski et al. (2019) followed the same line and included about a third of (un)answerable questions in their NATURAL QUESTIONS (NQ), an annotated open-domain QA dataset.Recently, MuSiQue (Trivedi et al., 2022) was introduced as a challenging multi-hop QA benchmark that consists of (un)answerable questions, in which supporting paragraphs have been intentionally removed from the context.Our experiments use these datasets to demonstrate the effectiveness of our approach to identify (un)answerability.
(Un)answerability capabilities in LLMs were mainly studied by using few-shot prompting (Kandpal et al., 2022;Weller et al., 2023).Moreover, several works have recently shown that LLMs become easier to steer with natural language prompts either as they become larger (Mishra et al., 2022a;Kandpal et al., 2022;Carlini et al., 2023) or as they are exposed to larger instruction tuning data (Mishra et al., 2022b;Chung et al., 2022;Wan et al., 2023a), and as a consequence, it might improve the (un)answerability capabilities of the model.Specifically, in this work, we utilize prompt manipulation in order to systematically reveal to the model the option of avoiding answering hard questions.Automatic prompt tuning can be also used for improving (un)answerability capabilities, without the need for manual handcrafting prompts.Liao et al. (2022) introduced a prompt tuning-based strategy to mitigate (un)answerable questions, by mapping questions into their proper, specific templates.
Other works tried to manipulate the model predictions towards better (un)answerability via using data augmentation (Zhu et al., 2019), and Asai and Choi (2021) provided an in-depth analysis of the ability to detect (un)answerability in LMs, where the case study is the data which is fed to the model.Furthermore, recent studies have suggested utilizing recent advances in white-box model interpretability (Geva et al., 2022b;Li et al., 2022;Mallen et al., 2022;Mickus et al., 2022;Meng et al., 2023;Geva et al., 2023) and probing (Adi et al., 2017;Conneau et al., 2018;Voita et al., 2019;Slobodkin et al., 2021) for manipulating the model predictions and analyzing when LLMs struggle to answer questions.Recent works also tried to use beam search decoding to manipulate the generated outputs by using the information encapsulated in several beams (Meister et al., 2020;Leblond et al., 2021;Slobodkin et al., 2023;Wan et al., 2023b).Finally, early exiting in language models (Schwartz et al., 2020;Schuster et al., 2022;Din et al., 2023) and model prediction calibration (Desai and Durrett, 2020;Jiang et al., 2021;Dhuliawala et al., 2022;Geva et al., 2022a) are strongly related to our work, as they suggest to analyze and improve the model predictions and output distribution.In addition to some basic instructions, our prompts can also have a "hint" to the possibility of (un)answerability, as well as 2 exemplars.

Method
We posit the hypothesis that, despite the inclination of LLMs to produce answers to (un)answerable queries, they do encode the (un)answerability of such queries within their latent representations.We examine this hypothesis by undertaking three distinct experimental approaches: (1) prompt manipulation, (2) beam scrutiny, and ( 3) probing (including identification and erasure of an answerability subspace).

Prompt Manipulation
First, we ask whether the model's ability to identify (un)answerable questions is sensitive to the exact wording of the prompt.Specifically, we ask whether merely raising the option of unasnwerability makes the model less susceptible to hallucination.To that end, we experiment with two types of prompts.The first type is designed to merely guide the model towards addressing a question.The second type, however, is more instructive in its approach.Besides guiding the model, it provides an advised course of action for scenarios where the question at hand is (un)answerable, hence indirectly hinting at the potential for (un)answerability.
Our experimental setup encompasses both zeroshot and few-shot prompts, with the latter involving the integration of two exemplars in the prompt.In the standard prompt setup, both exemplars are answerable.However, within the hinting prompt framework, one exemplar is designed to be (un)answerable.Figure 2 demonstrates all variants.

Beam Relaxation
Recall that the output of LMs is usually decoded with algorithms such as beam search.We aim to ex-amine whether we can endow this algorithm with a bias towards unanwerability.Focusing on the zero-shot setting, we gradually increase the beam size.Then, instead of automatically choosing the highest-probability answer from the final set of k options, we search for a reply within the final k options that signifies (un)answerability (Appendix A).
If such an answer is discovered, we substitute the top-beam answer with "unanswerable".

Identifying an Answerability Subspace
In a subsequent set of experiments, our objective is to find evidence for (un)answerability encoding directly in the embedding space of the models, by probing the models' last hidden layer.For each task, each model is prompted with a balanced trainset comprising 400 answerable and 400 (un)answerable examples.Then, for each instance, we take the embedding from the final hidden layer of the first generated token and train a linear classifier, using logistic regression, to predict answerability.2Subsequently, we assess the performance of each classifier on the corresponding test set.As a baseline, we also conduct similar experiments using the initial (non-contextual) embedding layer, which should not encode whether the question is answerable or not.Our core objective within this experimental setup is to ascertain whether a basic linear classifier, trained on a modestly sized dataset, suffices to effectively discriminate between answerable and (un)answerable queries.

Erasing the Answerability Subspace
Upon identifying a linear subspace that corresponds to (un)answerability, a natural question to ask is whether that subspace has a behavioral relevance, i.e., whether it is being used by the model when producing text.Importantly, this is different than mere encoding of the information, as the information can be present in the representation and at the same time be irrelevant to the model's behavior (Hewitt and Liang, 2019;Elazar et al., 2021;Ravfogel et al., 2021).Recent work on linear concept erasure (Ravfogel et al., 2020(Ravfogel et al., , 2022b;;Belrose et al., 2023) have proposed a set of methods to erase arbitrary linearly-encoded concepts from neural representations, following the intuition that by erasing a subspace that encodes the concept and examining the effect on the model's output, we can verify that the subspace we identify is behaviorally meaningful, opening an avenue for performing interventions in that subspace in order to modify the model's behavior.These methods start with the original representations alongside binary labels (e.g., representations of the text alongside binary gender annotations for each text), and return a new representation which is linearly guarded in the sense that any linear classifier trying to recover the concept from the representation will fail.
While linear erasure has its limitations (Ravfogel et al., 2022a), it has been proven to be an effective method for intervening in the latent representations of black-box models.
We use the recently proposed method of Belrose et al. ( 2023) which provides a closed-form solution for the concept-erasure objective.Concretely, given a binary concept (answerability), the method provides a projection matrix that minimally changes the representations (in the L 2 sense) while at the same time guarantees the inability to linearly predict the answerability from the modified representations.We fit the method over the lastlayer representations of the training instances from Flan-UL2, particularly when these instances are prompted with regular queries from the SQuAD benchmark (refer to §3.3).Then, during inference, the concept-erasing projection matrix is applied in the first generation step, specifically for the test set within the same model-dataset pairing.Our goal is to inspect whether removing the linear separation that exists in the latent space of the model between answerable and (un)answerable questions changes the behavior of the model.

Experimental Setup
Our experiments focus on several language models and on three benchmarks.
Benchmarks We consider three QA benchmarks, incorporating (un)answerable questions, in a reading comprehension setting where models are tasked with responding to a question within a given context.For each benchmark, we use the entire development set to construct our testing dataset.Ad-ditionally, for the probing experiments involving the training of linear classifiers on the models' embeddings, sample 1000 instances from each benchmark's trainset, evenly distributed between answerable and (un)answerable instances.Of these, we reserve 800 instances for the training of classifiers, with the remaining instances forming the development set for these classifiers.Below, we describe how we associate asnwerable questions with paragraphs that contain the answer, and how we associate (un)answerable questions with challenging paragraphs (that do not contain the answer, but may be topically similar to the question).
Our first benchmark is SQuAD 2.0 (Rajpurkar et al., 2018), a reading comprehension dataset, composed of manually-curated question-answer pairs alongside (un)answerable questions, each derived from a single paragraph.
The second benchmark we explore is NATURAL QUESTIONS (NQ; Kwiatkowski et al., 2019), a dataset accumulated from user-generated queries on the Google search engine.Each item within the dataset consists of a question, a retrieved article, a selected paragraph from the article (referred to as the "long answer"), and a short answer inferable from the paragraph.Despite its potential to test QA systems with a retrieval component, our interest lies exclusively in the question-answering setting, hence we utilize the "long answer" as the context, assuming an oracle retrieval system.For the formulation of answerable instances, we select cases with both a long and a short answer, using the former as the context and the latter as the response.For the (un)answerable questions, we pair each query with a paragraph from the sourced passage that has not been annotated as the "long answer".In order to create a challenging dataset, we select the paragraph that is closest in meaning to the question.To achieve this, we encode both the question and all potential paragraphs using a sentence-transformer (Sentence-Bert; Reimers and Gurevych, 2019) and select the paragraph that exhibits the highest cosinesimilarity score.
Our final benchmark is the MuSiQue dataset (Trivedi et al., 2022), a multi-hop dataset featuring both answerable and (un)answerable questions.Each instance consists of a question, several candidate paragraphs, an answer, and a decomposition of the question into its single-hop sub-questions.Additionally, each sub-question is paired with a paragraph that has its answer, with all those align- Table 2: F1 scores over the (un)answerability classification task in both zero-shot and few-shot setting.Each model is prompted with a regular prompt and with a prompt that hints at the possibility of (un)answerability ("+hint").In the few-shot setting, results are averaged across three variations of in-context-learning examples (with standard deviation in brackets).Bold marks the better prompting method.
ing paragraphs concatenated and used as context.
Conversely, for the (un)answerable queries, the absence of such alignment for some sub-questions is observed.For these (un)answerable instances, we identify the paragraph most closely linked to each of the unanswered single-hop questions, using a process akin to the approach with the NQ benchmark.These identified paragraphs are then aggregated, together with the paragraphs corresponding to the other single-hop questions, to form the context for the (un)answerable queries.Table 1 details the full statistics of our test sets across all three benchmarks.These test sets are obtained from the development set of each respective benchmark.

Evaluation
The main task over which we evaluate models is the (un)answerability classification task.When evaluating QA models over this task we only examine whether they tried to answer, i.e., we count every example for which the model provides an answer as an instance of answerability prediction, and each example for which the model did not provide an answer as an instance of un-answerability prediction, 3 .Note that this evaluation does not consider the correctness of the answers provided.The metric associated with this task is the F1 score, with "unanswerable" considered the positive label.Linear classifiers ( §3.3) are also evaluated over the (un)answerability classification task, as the classifier is trained to predict whether or not the question is answerable, based on the hidden representations of the LM.Additionally, in order to make sure that our methods do not hinder the performance of the models over their primary task, we evaluate them over the QA task as well, using the splits provided by the tasks' designers.We report the commonly used metrics: exact match (EM) and (token-wise) F1 scores (Rajpurkar et al., 2016).

Zero-Shot Scenario
Table 2a presents the results in the zero-shot setting (the model was not provided with questionanswer examples).It shows that the detection of (un)answerable questions is substantially improved upon the integration of a hint towards the possibility of (un)answerability into the prompts, with gains as high as 80 points.It can also be observed that, without a hint, the ability to discern (un)answerable queries tends to be superior in larger models.Interestingly, the introduction of the hint appears to mitigate the impact of model size, as evidenced by the smaller Flan-T5 xxl surpassing its larger counterparts in two out of three benchmark evaluations.
Additionally, Table 3a displays the models' exact match and token-wise F1 scores over the QA task where the model is tasked with both detecting (un)answerable questions and provide a correct answer to the answerable ones.It reveals a notable enhancement in the quality of generated responses when prompted with a hint, in some cases resulting in improvements of over 50 points (on both met- (2.9) 48.9 (2.9) + hint 86.0 62.9 (2.7) 69.9 (2.6) (2.2) 75.9 (1.9) 36.7 (2.3) 44.3 Table 3: Exact match (EM) and (token) F1 scores over the QA task in zero-shot and few-shot setting.For each model, there are two prompt variants: regular and with a hint of the possibility of (un)answerability.In the few-shot setting, results are averaged across three variations of in-context-learning examples (with standard deviation in brackets).Bold marks the better prompting method.
rics).The improvement over the QA task arises in large part from models giving the correct response to (un)answerable questions.We observe an average drop of 8.3% in F1 and 7.1% in exact match over answerable questions in a zero-shot setting when providing the hint (see Appendix B for all the results over answerable questions).

Few-Shot Scenario
Table 2b provides an exhaustive overview of the results in the few-shot setting.In order to mitigate the impact of the chosen examples, we experiment with three variants of in-context examples for each benchmark, and report the average results, as well as the standard deviation. 4Mirroring the trend seen in the zero-shot scenario, when the prompts encapsulate a hint towards the potential of (un)answerability, there is a significant improvement in the identification of (un)answerable queries.This trend is further corroborated by Table 3b, which reports the exact match and tokenwise F1 scores of the models over the QA task.See Appendix D for a comparison of the two possible hints: in the instructions, and as an (un)answerable example.

Beam Relaxation
Figure 3 illustrates the models' ability to detect (un)answerable queries, when gradually increasing the beam size.Although the increase in beam size yields a negligible impact on the final, most probable response (as depicted by the horizontal lines in Figure 3), it shows better recognition of (un)answerability.This is illustrated by a consistent increase in the presence of (un)answerabilityacknowledging responses5 within one of the beams (signified by the height of the bars).This observation underscores the notion that beneath the facade of overconfidence expressed by these models, the models do encode their inability to respond to certain queries.Importantly, we find that this approach has very little negative impact on the answerable questions, with only a slight degradation in the exact match and F1 scores (see Appendix E for further details).Notably, we conjecture that the observed decrease in performance on the NQ and MuSiQue benchmarks, compared to SQuAD, can be attributed to two main factors: distribution shift and a more challenging task environment.One contributing factor is the non-conventional format of queries in NQ; unlike the typical question format found in datasets like SQuAD, NQ queries do not always adhere to this pattern.Language models (LLMs) primarily trained on question-answering datasets, like SQuAD, might struggle with this distribution shift, leading to a decline in their performance when faced with non-question-formatted  queries.
In addition, the MuSiQue dataset introduces a significant challenge by requiring multi-hop reasoning.There are limited datasets available on which models can be trained for such complex tasks, and even fewer with (un)answerable questions.This scarcity, coupled with the demand for multi-hop reasoning, amplifies the difficulty of MuSiQue.This high complexity is further highlighted by the diminished performance of models, even when responding to answerable questions, as evident in Tables 6 and 9 in Appendices B and E, respectively.This drop in performance shows how challenging these benchmarks are, especially when compared to easier ones like SQuAD.

Identifying an Answertability Subspace
We report the performance of the linear classifiers in Table 4. Notably, when considering the stan- Table 4: F1 scores of (un)answerability classification of the linear classifier trained for each model-dataset pair, once with the regular prompt and once with the prompt that hints at the possibility of (un)answerability.For each model, we classify once based on the first layer and once based on the last layer of the first generated token.
dard prompt, the F1 of the probe is above 75% for all models and datasets.Furthermore, we find that hinting to the possibility of (un)answerability only marginally improves the ability to correctly classify queries from the representations within the models.These suggest the existence of an '(un)answerability' linear subspace.
Visuazliation.To examine this hypothesis, we perform a PCA projection of the embedding of the final hidden layer of the first generated token onto a 3-D plane.Figures 1, 4, 5 display the results for the Flan-UL2, Flan-T5 xxl and OPT-IML models, respectively.Consistent with our hypothesis, it can be observed that (un)answerable queries, which were correctly identified as such by the model (depicted by red dots in the figures), are distinctly separate from the answerable queries (represented by blue dots in the figures).This separation becomes especially pronounced in the context where the prompt incorporated a hint (as illustrated in the right subfigures).Importantly, we find that (un)answerable questions, which the models failed to recognize as such and instead generated a hallucinated response (indicated by pink dots in the figures), appear to reside within a separate linear subspace.This finding demonstrates that, notwithstanding the overconfidence exhibited by these models, they intrinsically possess the capacity to distinguish (un)answerable queries.This intrinsic capability is particularly evident given that the subspace corresponding to hallucinated (un)answerable questions (pink) seems to be positioned between that of the answerable queries (blue) and that of correctly identified (un)answerable queries (red).This positioning is suggestive of the models' inherent uncertainty.Trasnfer Between Datasets.In Figure 6 we present the transferability of the (un)answerability classifier trained on a given dataset to other datasets.While performance deteriorates, the F1 scores are still well above the F1 scores we calculated over the uncontextualized first layer.This suggests that, to a large degree, the probes identify an abstract (un)answerability subspace beyond dataset-specific shallow features.Table 5: Exact match (EM) and F1 scores of all questions and of answerable questions in the zero-shot setting for the Flan-UL2 model on the SQuAD benchmark with a beam size of 3. The results demonstrate the performance before and after the application of the concept erasure, for the regular k-beam decoding approach, and the relaxed variant.

Erasing the Answerability Subspace
Recall that if the subspaces we found are causally related to the predictions of the model, we expect that erasing them would deteriorate the model's performance in the answerability task.Indeed, when linearly erasing the answerability subspace from the first token representation of Flan-UL2, we see the F1 score over the (un)answerability classification task decreasing from 50.1 to 31.2 with regular beam, and from 65.4 to 32.7 with beam relaxation.This trend is also evident from the results on the QA task presented in Table 5, as well as when projecting the embeddings on a 3-D plane using PCA, as depicted in Figure 7.This suggests that the answerability subspace is influencing the model's behavior in the context of the answerability task.

Conclusion
We found ample evidence for LM's ability to encode the (un)answerability of questions, despite the fact that models tend to be over-confident and generate hallucinatory answers when presented with (un)answerable questions.We also showed that this discrepancy between model output and its hidden states is mitigated by simply adding the option of (un)answerability to the prompt.The evidence we found includes the existence of a reply acknowledging the (un)answerability in a beam of decoded answers, meaning that even though the models' best-assessed answer is hallucinatory, the true answer is not lagging too far behind.We also showed that the models' representations after encoding the question and before decoding the answer are highly influenced by the answerability of the question or lack thereof, with answerable and (un)answerable questions being linearly separable in the embedding space.We conclude that Before Erasure After Erasure the problem of answered (un)answerable questions can be mended with either better prompting, better decoding, or simple auxiliary classification models.

Limitations
We focus on a few datasets and models.Despite the effort to experiment with several models, future work should experiment with different models, and in particular, examine the relation between the ability to encode (un)answerability and model scale.
We also do not compare the different approaches explored in this paper, which we leave as an interesting future research direction.Our focus on linear probing and linear erasure stems from the availability of existing methods from this family, but deep LMs are highly nonlinear and may encode the information we are interested in in a nonlinear manner.As such, our results should only be interpreted as a lower bound for the identification of (un)answerability.Lastly, our experiments focused on (un)answerability in a given context.Future work should also explore the phenomenon in the open-domain setting.

Ethics Statement
Model hallucination, in general, can have realworld implications when models are incorporated in, e.g., search engines or other applications.Our study focuses on the ability to discern a specific type of hallucination in a selected set of models and datasets.It should not be taken as a general solution to the problem of hallucination in the QA setting, but rather as preliminary research on potential techniques for mitigating the problem of hallucination.

A (Un)answerability-Recognizing Responses
After analyzing the responses generated by the different models in this work, we curated a list of answers that signify abstention from answering, which we used to identify responses that signify (un)answerability.This includes: "Unanswerable", "N/A", "I don't know", "IDK", "Not known", "Answer not in context", "Unknown", "No answer", "It is unknown", "None of the above", "None of the above choices", "The answer is unknown", along with their corresponding versions in lowercase.

B Performance on the Answerable Instances
Table 6 show the exact match and F1 scores of each model over the QA task only on the answerable questions of each benchmark, in the zero-shot setting and the few-shot setting.Note that although the addition of the hint hinders the models' performance, the drop over the answerable questions is small for the most part and outbalanced by the improvement of detection of (un)answerable questions, leading to the overall improvement shown in Table 3.

C Prompt Variant Tuning
In our work, we experiment with three variants of the prompt containing a hint of the possibility of (un)answerability.These are: 1. Given the following passage and question, answer the question.If it cannot be answered based on the passage, reply "unanswerable".
2. Given the following passage and question, answer the question.If you don't know the answer, reply "IDK".
3. Given the following passage and question, answer the question.If there is no correct answer, reply "N/A".
We run all three variants on a separate development set6 , to decide which prompt to use for each model and dataset.(on all models), and the third variant in our experiments on the MuSiQue dataset (on all models).

D Impact of Hint Placement
To gain a deeper understanding of the circumstances under which the addition of a hint in the few-shot scenario is most beneficial, we conduct two ablations on the prompts.In one we gave a hint only in the instructions with all examples answerable, while in the other we gave special instructions but one of the two examples was (un)answerable.As per the data presented in Table 8, it is evident that the inclusion of a hint within the instructions is considerably more advantageous compared to its addition within the exemplars.Indeed, once a hint is incorporated within the instructions, further inclusion within the exemplars has a minimal impact on the results, with OPT-IML being the sole exception.

E Impact of the Relaxed Beam Search Decoding on the Answerable Queries
Table 9 associated exclusively with each model on answerable questions, evaluated under the framework of our beam inspection experiments (see Section §3.2).Two decoding approaches were employed: a conventional beam-search decoding and its relaxed variant where the top answers are supplanted by an (un)answerability-recognizing response if it emerges within the beam.Our findings suggest a marginal impact of the adapted beamsearch on answerable queries, with a maximal reduction of 7.7 and 8.7 points observed in the exact match and F1 scores respectively, when compared to its regular counterpart.Like in Appendix B, these results point to the fact that any improvement achieved in §5.2 indeed stems from better treatment of (un)answerable questions.

F In-Context-Learning Variants
In order to mitigate the effect of the chosen incontext examples in the few-shot setting, we experiment with 3 variants of in-context examples, and average their scores.Figure 8, Figure 9, and Figure 10 show the different in-context examples variants for the SQuAD, NQ, and MuSiQue tasks, respectively.
Figure 1: 3D PCA projection of the last hidden layer's embedding of Flan-UL2 on each of the three benchmarks.The left images show the embeddings with the regular prompt, and the right ones -with a hintincluding prompt.Blue and red dots are examples correctly detected by the model as answerable and (un)answerable, respectively, while the pink dots are for (un)answerable examples that the model provided answers to.The figures show the good separability between the three groups.

Figure 2 :
Figure 2: Combinations of prompt variants in this work.In addition to some basic instructions, our prompts can also have a "hint" to the possibility of (un)answerability, as well as 2 exemplars.

4Figure 3 :
Figure 3: F1 over the (un)answerability classification task with beam relaxation.In this setting, the models were considered successful iff a reply acknowledging the (un)answerability of a question was found anywhere in the beam.The horizontal lines show the F1 for the usual metric, i.e., successful classification only if the correct reply was on the top of the beam.

Figure 4 :
Figure 4: 3D PCA projection of the last hidden layer's embedding of the Flan-T5 xxl model on each of the three benchmarks.The left images show the embeddings with the regular prompt, and the right ones -with the prompt with a hint.

Figure 5
Figure 5: 3-D PCA projection of the last hidden layer's embedding of the OPT-IML model on each of the three benchmarks.The left images show the embeddings with the regular prompt, and the right ones -with the prompt with a hint.

Figure 6 :
Figure 6: F1 scores of (un)answerability classification, as determined by a linear classifier trained for each model-dataset pair, and tested on the other benchmarks.Within each heatmap, the column designates the dataset used for training, while the row illustrates the dataset on which the classifier was tested.

Figure 7 : 3 -
Figure 7: 3-D PCA projection of the last hidden layer's embedding of the Flan-UL2 model on the SQuAD dataset, without performing erasure (left) and after (right).

Table 1 :
Number of extracted answerable and (un)answerable questions per dataset in our test set.
Table7shows the results of all three variants on our development set.Based on these results, we decide to use the first variant in our experiments on the SQuAD and NQ datasets

Table 6 :
Exact match (EM) and F1 scores over the QA task only for answerable questions in zero-shot and zero-shot setting.For each model, there are two prompt variants: regular and with a hint of the (un)answerability.In the few-shot setting, results are averaged across three variations of in-context-learning examples (with standard deviation in brackets).Bold marks the better prompting method.

Table 7 :
Exact match and (token) F1 scores in the zeroshot setting for three variants of the prompt containing a hint of the possibility of (un)answerability, on our development set.

Table 8 :
F1 scores over the (un)answerability classification task in few-shot setting.Each model is prompted with a regular prompt, and with three types of hintincluding prompts: only in the instructions ("+hint (I)"), only in the exemplars ("+hint (E)") and in both ("+hint (E&I)").Results are averaged across three variations of in-context-learning examples (with standard deviation in brackets).Bold marks the better prompting method.