Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text

Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hal-lucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE ( W iki H ow E vidence RE trieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article’s topic, but not all can be answered using the provided context. We interestingly find that when us-ing a regular question-answering prompt, Chat-GPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.


Introduction
Generative language models (LMs) are trained to generate an output text given an input text.While such models have recently shown a remarkable performance in various NLP tasks (Touvron et al., 2023a;Radford et al., 2019;Brown et al., 2020), they are known to suffer from hallucination, i.e., they often generate text that lacks evidence (McKenna et al., 2023;Ji et al., 2022).This may lead to the spread of misinformation (Dong et al., 2022;Carlini et al., 2021), and thus reduce the systems' trustworthiness.
Are there any risks or precautions to keep in mind when trying to raise alkaline phosphatase levels in the body?What is alkaline phosphatase and why is it important to maintain healthy levels?
How to Raise Alkaline Phosphatase Levels [1] Alkaline Phosphatase (ALP) is an enzyme present throughout your body, and mostly it's nothing to worry about.[2] However, ALP deficiencies do happen sometimes as a symptom of larger medical problems.[...] [6] === Testing Your Alkaline Phosphatase === [7] #Evaluate your risk factors for low ALP.[...] [24] === Making Dietary Adjustments === [25] #Make sure you're getting sufficient calories.[26] Since the most common cause of chronic low ALP is malnutrition, your diet is often the first thing you need to adjust.[...] answer evidence no evidence Figure 1: Evidence retrieval for questions related to instructive articles from WikiHow.For the question in the upper box, a system should ideally identify the sentences annotated as evidence in the text.For the question in the lower box, it should not retrieve wrong sentences as "evidence." In this paper, we focus on a use case where a generative LM is queried for advice on a range of personal issues, including health or interpersonal relationships, or difficult tasks.This is a challenging scenario for LMs because questions are often openended and non-factoid, and require well-informed instructions as answers.As illustrated in Figure 1, we explicitly query generative LMs to retrieve evidence sentences for answering a question from a trustworthy instructive text.Our challenging setup requires two competencies on the model side: (i) identifying whether or not the question is answerable using only the provided text as input, and (ii) retrieving evidence from the trustworthy source, which could, e.g., support a generated answer.
Existing question answering datasets, e.g., Wiki-HowQA (Deng et al., 2020) and SQuaD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018)), do not fit our evaluation setup.The WikiHowQA dataset (Deng et al., 2020) uses the titles as questions, and does not cover the sentence retrieval aspect.SQuaD contains unanswerable questions but focuses on factoid questions.To make our evaluation setup challenging and sound, we create a new high-quality test set.We collect a set of diverse and open-ended questions for WikiHow articles via crowd-sourcing, and perform double annotation of evidence sentences in the articles.We use our dataset to perform a challenging evaluation of ChatGPT, a successor of InstructGPT (Ouyang et al., 2022b), which has been pretrained on a huge amount of texts including instructive web texts, in a systematic manner.
Our contributions are as follows: (1) We create and publish WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation test set of lexically diverse and open-ended questions for instructive articles taken from WikiHow.Evidence sentences are annotated with high agreement in all documents.WHERE contains both questions with evidence in the article and questions without.
(2) We evaluate ChatGPT on this dataset in zeroand few-shot settings.Our experiments show that despite decent results on retrieving evidence for questions with evidence in the text, ChatGPT fails to recognize questions for which the text does not provide any evidence.When provided with a few no-evidence examples in the prompt, it refuses to answer if there is no evidence, but at the expense of recall of sentence retrieval.(3) We make our dataset and code publicly available. 1

WikiHow Evidence Retrieval Dataset
Our goal is to collect questions with and without answer evidence in the text that are lexically, syntactically, and semantically diverse.In a pre-study, we find it hard to achieve this goal when the content of the article is known to the writer.We resort to crowdsourcing for question writing, and identify good cases by double-annotating the sentences deemed as evidence for the answer.Our reasoning is that if agreement on this task is low, the question is either somewhat ill-posed or too close to the overall topic of the article.

Dataset Creation
We export WikiHow2 articles for the following categories: 3 Arts and Entertainment, Home and Garden, Health and Relationships and Travel.We collect articles for all categories in October 2022 and additional articles for Home and Garden and Arts and Entertainment in December 2022.
Question collection.To collect questions, we set up a Human Intelligence Task (HIT) on Amazon Mechanical Turk.We display the title of the article (e.g., "How to Dress up as a Disney Character"), the first paragraph of the article, and a set of keywords generated from the full article, and ask the crowd-workers to write six questions for which they would expect to find answers in the article (see Appendix D).Since annotators never see the full article, they can only make educated guesses about which questions may be answered by the full article and thus sometimes write questions that cannot be answered given the complete article.We encourage workers to start their questions with different question words (what, why, how, can, should, who).Our crowd-workers must be Master Workers, live in the UK or US, and have a HIT approval of at least 95%.
Answer evidence annotation.We tokenize documents into sentences using NLTK (Bird et al., 2009) and rule-based corrections, e.g., for enumerations.We then use the web-based annotation platform INCEpTION (Klie et al., 2018) to mark all sentences in an article that provide evidence for answering a question.We double-annotate 570 questions from 95 documents in two teams: one team is composed of several authors of this paper, the other team of paid annotators with engineering backgrounds and prior experience in NLP annotation tasks, who participate in a training phase.We allow annotators to discuss difficult cases within each team.
Agreement.We treat the annotations of one team as the gold standard and the other team's as the system.They agree with precision/recall/F1 of 70.6/57.3/63.3 on whether a document provides no evidence for a question at all.This corresponds to a κ-score (Cohen, 1960) of 0.43, which can be interpreted as moderate agreement according to Landis and Koch (1977).
To ensure a high-quality evaluation set, we com- pute question-level precision, recall, and F1 for the binary task of deciding whether a sentence provides evidence for answering a question.We keep only the questions with an F1 score of at least 0.3, and the questions that both teams consider to not have evidence.Annotators often disagree if the question is somewhat unclear or if the text contains only evidence of part of a question's aspects.By design, our filtering using a positive threshold for F1 removes any questions that only one team considered to have evidence in the article.Sentence-level κ is 0.60, which is generally considered a solid score in semantic annotation tasks.For creating our gold standard, we take the union of the sentences marked by both teams as relevant evidence, as disagreements on the filtered cases are mostly due to different decisions on how much context to include.

Dataset Analysis
Table 1 shows the corpus statistics for our final test set.About half of the questions can be answered based on the evidence in the corresponding document (with-evidence questions), the other half cannot (no-evidence questions).Examples are shown in Figure 1.The instructive texts in WikiHow are kept simple as indicated by the short average sentence length.However, documents are long, which adds another challenge to our setup.For withevidence questions, on average, about 14% of the sentences are part of the evidence.Table 2 illustrates the distribution of question types: questions are indeed varied.Out of the yes/no questions, 81% ask for specific facts (is/are there, will/do/does) and 19% ask for suggestions or recommendations (should/can/could).

Method
We evaluate ChatGPT (version of March 2023, built upon GPT-3.5) in two settings.In the zero-  4To accommodate for ChatGPT's context window size of 4096 tokens, we only use chunks of the training articles in the few-shot setting.Appendix B provides details.
Since ChatGPT may generate anything even though we ask it to output a list, we need to post-process the model outputs before evaluation.We first attempt to parse the model output using Python's eval function.If this fails, we try to match lists within the string (as ChatGPT sometimes provides explanations or just copies the selected sentences in addition to the list) using various regular expressions.If none of this works, which happens only three times in the entire experiment, we manually parse the model output.A small number of questions is rejected by OpenAI's content filter.For those, we assume the empty list as model output as ChatGPT does not retrieve any evidence from the article.

Experiments and Analysis
Evaluation metrics.For no-evidence questions, we report recall for correctly recognizing that an article does not contain any evidence for answering a question.For with-evidence questions, we compute precision, recall, and F1 score of the rele-vant class ("evidence sentence") per question, and report macro-average over questions.
Baselines.The random baseline for withevidence questions predicts that a sentence contains relevant evidence in 14% of the cases, which corresponds to the average percentage of sentences marked as evidence per question in the test set.Similarly, it predicts "no-evidence" for a question with a probability of 49.2%.The results reported for "human scores" correspond to the agreement scores.
Results.Table 3 shows the zero-and few-shot performance of ChatGPT, separately evaluated on with-evidence and no-evidence questions.In the zero-shot setting, ChatGPT outperforms the baseline for the with-evidence questions by a large margin, but fails to recognize when there is no evidence for a question in an article.When presented with five training instances of which two are no-evidence questions, ChatGPT almost reaches human performance on recognizing no-evidence questions.This, however, comes at the cost of decreased performance on the questions with evidence.On with-evidence questions, ChatGPT is still far from human performance in both zero-and few-shot settings.
Analysis.In the zero-shot setting, OpenAI's content filter identifies the content of 10 of the test cases harmful and declines to generate outputs for them.We do not consider these instances to be harmful, e.g., they are about instructions on how to find emergency procedures in hotels.For the few-shot configurations, these cases vary from 11 to 16 instances, depending on the few-shot prompts.This illustrates that improving content filters is an important future research direction.
Table 4 breaks down the evaluation results by the question types identified in Table 2. Comparing how-, whatand yes/no questions, we find that the latter are the easiest in both the zero-and the fewshot setting.In terms of recognizing that there is no evidence for a question, how questions are consistently more difficult than what and yes/no questions.
Applicability beyond ChatGPT.While this paper focuses on ChatGPT, WHERE can be used to evaluate any other large LM whose training data does not contain WHERE.We also evaluated Llama 2-Chat (Touvron et al., 2023b)  Table 5 in Appendix C).In addition, we needed to manually parse 30 model outputs (compared to one when using ChatGPT).
Impact of pre-training data.It is well possible that some of the WikiHow articles (or earlier versions thereof) have been part of ChatGPT's training data.For our evaluation setup, however, this is not an issue, since the documents are contained in the model input anyway.We intentionally did not crawl questions from the web or, e.g., the WikiHow comments sections, but let crowd-workers write them based on our methodology mitigating bias towards the articles content.Hence, it is unlikely that existing LLMs have seen these specific questions or their answers as part of their pre-training.

Related Work
Extractive QA datasets.The SQuAD datasets (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) are the largest datasets for extractive QA.The questions are factoid and can be answered through short, single answer spans.In contrast, our dataset includes non-factoid questions based on instructional text and requires models to identify a set of answer spans across long documents.The MultiSpanQA (Li et al., 2022) and MASH-QA (Zhu et al., 2020) datasets also contain QA pairs with several answer spans.However, MultiSpanQA is automatically derived from Natural Questions (Kwiatkowski et al., 2019) and MASH-QA also contains automatically created QA pairs.Only few datasets contain unanswerable questions, e.g., the latest version of SQuaD.However, its questions have been created given a text passage, whereas the questions of WHERE are based only on the keywords and a summary of an article.This leads to larger lexical variety between the question and the text and, thus, creates a more realistic and challenging setting.Table 4: Results of ChatGPT on our evaluation set, separated by question type.For 5-shot, we report average and standard deviation of 5 different randomly sampled per-category 5-shots results.-: no questions of this type in the dataset.#w: number of with-evidence questions, #n: number of no-evidence questions.
Evidence retrieval.Our work builds on a long line of NLP work addressing the fine-grained retrieval of supporting facts from text.This is a common basis for tasks like question answering (Murdock et al., 2012), fact checking, or slot filling (Petroni et al., 2021), and has also been used to support other tasks, e.g., question generation (Lewis et al., 2020), document retrieval (Akkalyoncu Yilmaz et al., 2019) or natural language inference (Vladika and Matthes, 2023).
Analysis of ChatGPT.Since the publication of ChatGPT, many papers focus on analyzing the applicability of the system (Liu et al., 2023b) for different tasks and domains, i.a., question answering for education (Frieder et al., 2023) and in medical environments (Nov et al., 2023).This large attention indicates the importance of the system for real-world applications and is the main motivation for our work.Despite promising performance in several benchmark datasets, for instance logical reasoning comprehension (Liu et al., 2023a) or complex question answering (Tan et al., 2023), previous studies also reveal short-comings of GPT models, for instance, their hallucination problem (Ouyang et al., 2022a;Li et al., 2023;Zhang et al., 2023).To address this issue, we propose to analyze the usage of ChatGPT in a setting in which we force it to retrieve evidence from text instead of directly generating an answer.

Conclusion and Outlook
In this work, we have introduced a challenging benchmark to evaluate ChatGPT on the task of retrieving evidence sentences from instructive text to answer a question.We have presented a new high-quality dataset from WikiHow consisting of crowd-sourced questions and expert evidence annotations.A special real-world challenge of our dataset is the inclusion of questions that are not answerable given an article.When evaluating Chat-GPT on our dataset, we found that it fails on detecting no-evidence questions if not provided with targeted examples in the prompt.Our results on the new benchmark highlight important shortcomings of generative large language models that need to be addressed by future research: besides hallucinating answers, there is no free lunch with regard to calibration in evidence-retrieval setups.

Limitations
Since we accessed ChatGPT via the OpenAI API, our results might not be reproducible if the model behind the API is exchanged with a new version.Moreover, previous work has shown that ChatGPT is stable only to an extent of 79% for complex question answering tasks (Tan et al., 2023).
Another limitation of using ChatGPT via the API is that we do not have access to the model parameters and probability distributions.This reduces the amount of analysis we can perform on the model results.
The dataset we present in this work is doubleannotated and, as a result, of high quality but rather small compared to other question-answering datasets (but other datasets are often crowdsourced, created using heuristics, and/or contain less varied questions).

Ethics Statement
Data source and licensing.We ensured that we may re-publish the WikiHow texts under CC BY-NC-SA-3.0 by obtaining written permission from WikiHow.Some of the topics of the selected articles are about health and relationships, yet, there is no personal information involved.
Crowd-sourcing and annotation.We collect our set of questions via Amazon Mechanical Turk.We pay 1 dollar per HIT, but increase to 2 dollars/HIT in the subsequent tasks, given the average completing time (between 5 and 10min).
The paid annotators participating in our project were completely aware of the goal of the annotations and even helped designing the annotation scheme.They gave explicit consent to the publication of their annotations.The main annotator was paid considerably above our country's minimum wage.For this type of semantic annotation task not involving any personal data, our institution did not require obtaining an IRB Review.

C Additional Results
Here, we provide an extended version of Table 3 that includes scores for Llama 2-Chat (13B) (Touvron et al., 2023b) in the zero-shot setting.Compared to ChatGPT, LLama 2 is equally bad at detecting no-evidence questions and recognizes relevant sentences for answering with-evidence questions worse.

D Crowdsourcing
We show the interface that we used for collecting the questions on Amazon Mechanical Turk in Figure 3.The interface contains two parts.On the left, we show crowd-workers two components based on the WikiHow article: the summary and the keywords.The summary is the first paragraph from a WikiHow article, which typically functions as an introduction to the article's topic.The keywords contain a set of keywords that we generated based on the WikiHow article using the Wordcloud7 package.On the right, we provide a form where crowdworkers can submit six questions.Crowd-workers are encouraged to make use of our suggestions on how to start a question and to start each question with a different word.Furthermore, in the first three input fields, we ask crowd-workers to ask a question about a certain topic.In Figure 3, these topics are, from top to bottom: Researching Upgrade Options, Planning Your Honeymoon and Upgrading on the go.These are the section headers from the WikiHow article, which we use to collect questions that are relevant to the article.

Figure 3 :
Figure 3: The interface used in our crowdsourcing set-up for collecting questions.The left part of the interface shows the summary and keywords based on the WikiHow article and the right shows the form that crowd-workers used to submit their questions.

Table 2 :
Distribution of question types in the test set.shotsetting, we prompt ChatGPT using a template that asks the model to output a list of sentence IDs that can provide evidence for the answer.The list is empty if the model does not retrieve any evidence (see Appendix A).In the few-shot setting, we configure the message history such that ChatGPT sees five training examples from the test instance's category, e.g., Health, using the same prompt template as in the zero-shot setting but including gold responses.The training instances consist of additional question-article pairs annotated by the paid annotator team only.Every five-shot training set contains exactly two no-evidence questions and three with-evidence questions, which are about different documents.
in the zero-shot setting, finding it to perform considerably worse than ChatGPT in the same setting (see

Table 5 :
Message history passed as input to ChatGPT in zero-and few-shot setting.x t : test instance, x i : training instance i, y i : ground truth for x i .Results on our evaluation set.For 5-shot, we report average and standard deviation of 5 different randomly sampled per-category 5-shots results.*Harmonic mean of the two recall scores, estimated on all 570 double-annotated questions.Llama 2: Llama 2-Chat (13B).