INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations

,


Introduction
Despite their recent success, LLMs have long been known to exhibit several patterns of concern (Weidinger et al., 2021) such as generating statements which may be toxic (Ousidhoum et al., 2021), biased (Ferrara, 2023), unfair (Ramesh et al., 2023) and factually incorrect (Azamfirei et al., 2023).The last pattern of generating factually incorrect yet seemingly confident statements is commonly labeled as hallucinations in the literature (Ji et al., 2023).It is an important area of study since the confident tone of these generated statements can lead to end users accepting them as accurate without any subsequent validation.
Model hallucinations occur in a variety of textual generative applications such as NLG, MT, QA, dialog systems, data to text systems, etc.It is believed to be caused by discrepancies in data used to train the models, or in the model training itself (Ji et al., 2023).Hallucinations are also believed to be caused by the supervised fine tuning process in which the model may learn to make factually ungrounded connections within its parametric memory in order to accurately answer the current question it is being trained on, which can trigger new ungrounded responses as hallucinations during inference.
Typical approaches to evaluate newly developed models for hallucinations have been to test them on Question Answering datasets such as TruthfulQA (Lin et al., 2021), which provides a curated set of challenging questions with valid answers, against which the model generated responses are compared.However, this approach of using a fixed test set with LLMs is inherently limited; the typical development cycle of a new LLM release involves pretraining on large text corpora regularly scraped from the Internet, and any new challenge dataset may eventually get scraped into this pre-training corpus.Given that LLMs have been shown to memorize training data (Carlini et al., 2022), this form of data leakage can lead to a false sense of improvement on the challenge test set in subsequent model releases.To address this risk, in this work, we instead propose to test LLM models using an evaluation framework which uses carefully crafted rules to create new challenge questions in each round.We call our framework INVITE: a testbed of automatically generated INValId questions to evaluaTE large language models for hallucinations.INVITE leverages valid facts from knowledge bases to create new invalid questions which may not have an answer.Our framework can be used to evaluate new LLM release candidates on their robustness against invalid questions which can trigger specific forms of hallucinations, as well as when developing new algorithms to mitigate hallucinations in existing models.The key contributions of our work are as follows: • We create a new framework to create invalid questions to evaluate robustness of LLMs against hallucinations1 .
• We test our framework on several latest LLMs, exploring different model sizes and training datasets.
• We conduct a pilot human evaluation study on the generated responses for these questions, and highlight the effectiveness of the test sets in triggering hallucinations in the models being evaluated.

Related Work
Question Answering datasets A number of QA datasets are available in literature to test LLMs for hallucinations, including TruthfulQA (Lin et al., 2021), SQUAD (Rajpurkar et al., 2016), TriviaQA (Joshi et al., 2017), among others.While challenging and effective, all of these test sets suffer from the previously described risk of possibly getting scraped and consumed in model training.
Adversarial QA datasets The QA datasets listed above test models in their ability to retain facts, and new models are tested by comparing their response to these questions against an expected answer.However, we argue that this strategy alone would not test a model against all possible failure modes related to hallucinations since this issue stems from the models' ability to concoct new facts as statements.Even if a model were to learn the correct answer to a particular QA question, it can still retain the general tendency to hallucinate.Hence, a more robust strategy would be to also test these models using adversarial questions with invalid assumptions.A model capable of avoiding hallucinations would identify that it does not have an answer to the question, or detect that the question itself is not plausible and that it cannot generate an answer, and hence choose to disengage.Notable adversarial datasets in NLP literature include (Jia and Liang, 2017), where the authors Source Dataset Question Categories almaMater, associatedBand, author, award, birthPlace, city, commander, country, musicComposer, office, party, position, predecessor, publisher, spouse, successor, team, writer TriviaQA InvalidDate, FutureDate  (Chen et al., 2022;Schäfer et al., 2023).With machine learning models such as LLMs, we can leverage a variety of generative models to create new datasets (Duan et al., 2017;Nikolenko, 2019), but to the best of our knowledge, no prior works have tried to generate questions with invalid assumptions.Our proposed approach addresses this gap by setting up a framework to automatically create a new test set of challenge questions with verifiably invalid assumptions, which are likely to trigger hallucinations in the target model.

The INVITE Framework
To create new test questions, we collect valid facts from a knowledge base and distort these to create new unanswerable questions.We describe our process in more detail below, and create a test set using this framework.

Creating Invalid Questions
We use the DBpedia knowledge base (Lehmann et al., 2015) as a source of valid facts to create our questions.The choice of knowledge base here is arbitrary and can be replaced by an alternate appli-cation specific knowledge base as necessary.DBpedia extracts structured factual information from Wikipedia, the world's largest encyclopedia.It contains a large volume of facts which are stored in the Resource Description Framework (RDF) format of subject-predicate-object triples.The most recent release of DBpedia contains over 850 million such factual triples (Holze), making it a decidedly rich source of information to create new test questions for our task.Of these, we use a subset of 42 million triples containing facts about objects and literals extracted from the Wikipedia Infoboxes, which are reported to be of higher quality because of their standardized format.For operational simplicity, we limit our scope here to the 100 most frequent predicate types by volume from this subset.We further discard noisy predicate types which contain ambiguous entries after manual inspection (for example, we discard the nationality predicate type since it contains answers of the form country-name as well as citizen/people of country-name, making facts of this type difficult to fit in a consistent question template).The exact list of predicates selected in our dataset creation is listed in Table 1.
To create new questions, we first curated over 300 predicate specific template questions which were manually crafted by annotators on Amazon Mechanical Turk, and further denoised by the authors.Next, we further refined a subset of these to create high quality question templates by posing the questions on a search engine, and iterated this process until the responses were unambiguous.We also created template answers for these selected high quality question templates, which we use in our subsequent experiments reported below.The specific prompts we used in our experiments are listed in Appendix A 2 .
For each new question generation, given a predicate, we first sample a fact triple from this predicate type and create a valid question using the corresponding template.Next, to create the invalid question, we create an invalid fact triple by sampling new subjects or objects found in facts from the same predicate type.We verify that this new triple does not exist as factual predicate in the dataset; if such a triple exists, then our created fact is actually valid, so we discard the same and repeat the sampling process above until we have an invalid triple.Given this (invalid) triple, we use our template for 2 Our curated question templates, along with model responses with labels can be downloaded from https://github.com/amazon-science/invite-llm-hallucinations.

Experiments
Testing model responses for hallucinations is a challenging task which needs a comprehensive fact verification system for automated evaluations.We instead use human verification to test for hallucinations in the generated responses.To evaluate the efficacy of our proposed framework, we first created a pilot test set of 100 questions, sampling uniformly from each category listed in Table 1.
Next we generated responses to each of these questions using the models described below, leading to a total set of 600 generated responses.Finally, we manually examine these generations and label them for hallucinations, utilizing a search engine for additional validation of model responses.While manually labeling samples, we only treat responses which explicitly make an inaccurate statement as hallucinations, treating all others (including empty or degenerate responses) as non-hallucinations.

Models
We models described below.We chose a diverse set of models with varied size, and training datasets for a detailed evaluation of our test set.All open source models were downloaded from Huggingface and evaluated on Nvidia A100 Tensor Core GPUs, while the proprietary GPT models were evaluated using OpenAI APIs 3 .We ran inference without decoder sampling to further reduce the models' tendency for hallucinations, and stopped inference after 150 tokens.GPT-Neo-2.7BGPT-Neo (Black et al., 2021) is a 2.7 billion parameter model developed by EleutherAI, and it follows the architecture of GPT-3.It was trained on the Pile (Gao et al., 2020), a large-scale dataset curated by EleutherAI for this task, which spans diverse tasks.
GPT-J-6B GPT-J-6B is 6 billion parameter model trained using Mesh Transformer JAX (Wang, 2021), and also trained on the Pile dataset from EleutherAI.
Open-LLaMA-7b-Open-Instruct This is an instruction tuned, open sourced release of the 7 billion parameter LLaMA model (Touvron et al., 2023), trained on the Open-Instruct-v1 dataset which consists of 63000 instruction training samples.
RedPajama-INCITE-7B-Instruct The RedPajama models were developed by a team of open source developers from several organizations.The base model was trained on the RedPajama dataset, a 1T token open-source implementation of the LLaMA dataset.Several model variants were available at the time of writing, and we used the 7B instruction tuned version of the model in our evaluations.
GPT models We also ran evaluations on Ope-nAI's GPT3.5-Turbo (OpenAI, a) and GPT4 (Ope-nAI, b) models.GPT3.5-Turbo is a text only model which supports understanding and generation of 3 Accessed on August 27, 2023 natural language and code, while GPT4 is Ope-nAI's most powerful LLM at the time of this writing.We used model snapshots from June 13, 2023 in both cases.These models are first pretrained on a large corpus and subsequently aligned using Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022).Additional details have not been disclosed by OpenAI.

Results
We present our hallucination rates for the models described above in Table 2.As observed in the table, all the open source models exhibit significant rates of hallucinations triggered by invalid assumptions in the test questions.Examples of the generated hallucinations from each model are shown in Table 4.Among all the open source models, RedPajama-7B exhibited the least hallucination rate at 81% while LLaMA-7B demonstrated the highest rate of hallucinations at 88%, closely followed by GPTNeo-2.7B(83%) and GPTJ-6B (82%).
On the other hand, OpenAI's GPT3.5-Turbo and GPT4 models showed considerably lower (yet nontrivial) hallucination rates at 17% and 6% respectively.Since OpenAI does not release internal details of these models, it is unclear if they incorporate any form of external knowledge retrieval which would provide an effective solution to handle questions with invalid assumptions.However, given the growing popularity of LLMs and the demand to deploy them in various low latency regimes, smaller ungrounded open source models such as those evaluated above are likely production candidates in many applications, and INVITE provides a valuable framework to evaluate these for hallucinations, without relying on a fixed test set which runs the risk of getting ingested in future training data crawls.Further, even in GPT4 and GPT3.5-Turbo,

GPTJ-6B
q: What was the name of Heather Sears's character in Gooische Vrouwen? a: The name of Heather Sears's character in Gooische Vrouwen is Mieke.identifying the 6% and 17% examples which were hallucinated highlights existing gaps in these models and can provide valuable insights needed to make them safe for deployment.

Automated Evaluation Metrics
Since human labeling is slow and expensive, they may not always be feasible to obtain.As an alternative, we also computed various automated metrics commonly employed in evaluating natural language generations.To estimate these metrics, we compare model generations to the invalid questions against reference answers created using the category wise answer templates listed in Appendix A.
As additional references, we also included two disengagement answers: I don't know and I can't answer that for comparison.Table 3 lists estimates for these metrics.From our experiments, n-gram count based metrics such BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004)

Limitations and Future Work
Any test set of limited size would not cover the entire possible space of invalid questions.Instead, we chose to sample a random subset of this space and obtain an empirical estimate of the model performance.
We define an invalid fact triple/relationship as one which does not exist in the knowledge base and this assumption maybe violated in boundary cases where facts may not have been entered into Wikipedia; however, we expect this to be marginal.
Generating invalid questions from a fixed set of templates may lead to limited diversity in questions.We used annotators from Amazon Mechanical Turk to address this but our coverage was limited to a relatively small set of predicates, which we will expand on in future work.
In current version of the framework, we only create single hop questions with two entities for simplicity.We will expand on this by creating multi-hop questions which necessitate complex reasoning in future work.
Finally, in future work we can also leverage this framework in LLM training by using such questions to teach the boundary of plausible knowledge to the model.

A Question Templates
We list questions used in our experiments in Table 5.

Table 1
While rich in diversity and volume, these datasets still suffer from the same risk of getting consumed in model training noted above.Further, their datasets are limited to reading comprehension tasks and hence do not necessarily test the full boundary of a model's knowledge.Automated Testset Generation Automatically created unit tests have been explored in deterministic applications such as software testing

Table 2 :
(Joshi et al., 2017)cination rates on a test set of invalid questions (results sorted by model size).Questions with Invalid Dates In addition to the questions extracted above, we create two more categories containing questions with invalid dates.Using regular expressions, we sample questions containing dates and years from the TriviaQA dataset's test set(Joshi et al., 2017)and create various distortions before adding these questions to our test set.Specifically, we distort full dates containing months by randomly selecting a new date beyond valid dates of the month (for example: March 32nd, 2023) and replace the old date.Similarly, we distort years by randomly sampling a new year from[2025, 2100]and replace the old year.

Table 3 :
Automated Metrics between generated responses and references.
evaluate the test set described in Section 4 on a list of open source and proprietary large language -2.7B q: When did Michael Stephenson write Northern Touch?a: Northern Touch is a novel by Michael Stephenson, published in 2007.It is the first of a trilogy, and the second book in the series is The Last of the Breed.The novel is set in the fictional town of Ballymote, in County Mayo, Ireland.The town is a fictionalised version of the town of Ballymote, County Mayo, in Ireland.The novel is set in the late 1950s, and is the first of a trilogy, and the second book in the series is The Last of the Breed.When did Terry Sabo succeed Fernand Roberge? a: Terry Sabo was elected to the Maine House of Representatives in 1988 and served until 1994.In 1992, he was elected to the Maine Senate and served until 1996.In 1994, he was elected to the U.S. House of Representatives and served until 2002.Fernand Roberge served as the U.S. Representative for Maine's 1st congressional district from 1973 to 1987.Terry Sabo succeeded him in the U.S. House of Representatives in 1987.

Table 4 :
Examples of hallucinated model generations on test questions with invalid assumptions.