Diving Deep into Modes of Fact Hallucinations in Dialogue Systems

Knowledge Graph(KG) grounded conversations often use large pre-trained models and usually suffer from fact hallucination. Frequently entities with no references in knowledge sources and conversation history are introduced into responses, thus hindering the flow of the conversation -- existing work attempt to overcome this issue by tweaking the training procedure or using a multi-step refining method. However, minimal effort is put into constructing an entity-level hallucination detection system, which would provide fine-grained signals that control fallacious content while generating responses. As a first step to address this issue, we dive deep to identify various modes of hallucination in KG-grounded chatbots through human feedback analysis. Secondly, we propose a series of perturbation strategies to create a synthetic dataset named FADE (FActual Dialogue Hallucination DEtection Dataset). Finally, we conduct comprehensive data analyses and create multiple baseline models for hallucination detection to compare against human-verified data and already established benchmarks.


Introduction
Knowledge-grounded conversational models often use large pre-trained models (Radford et al., 2019;Brown et al., 2020).These models are notorious for producing responses that do not comply with the provided knowledge; this phenomenon is known as hallucination (Dziri et al., 2022b;Rashkin et al., 2021b).Faithfulness to the supplementary knowledge is one of the prime designing factors in these knowledge-grounded chatbots.If a response is unfaithful to some given knowledge, it becomes uninformative and risks jeopardizing the flow of the conversation.Despite retaining strong linguistics abilities, these large language models(LM) inadequately comprehend and present facts during conversations.LMs are trained to emulate distributional properties of data that intensify its hallucinatory attributes during test time.
Figure 1: Hallucination manifested by generated responses using GPT2 (Radford et al., 2019) trained on KG triples can be more nuanced.
On the one hand, many prior works (Wiseman et al., 2017;Parikh et al., 2020;Tuan et al., 2019) have suggested training these models on external data to ensure faithfulness may lead to a sourcereference divergence problem, where the reference contains additional factual information.To address this problem holistically, Dziri et al. has proposed a two-step generate-then-refine approach by augmenting conventional dialogue generation with a different refinement stage enabling the dialogue system to correct potential hallucinations by querying the KG.Also, this work employs a token-level hallucination classifier trained on a synthetic dataset constructed using two perturbation strategies2 .Though this method has clear benefits, the hallucination perturbation strategies proposed in this work might fail to capture some of the subtle attributions of a factual generative model.As illustrated in Figure 1, neural models can inject hallucinated entities into responses that are present in the k-hop KG and are deceptively similar to what is expected.Also, if we cannot detect these elusive hallucinations beforehand, it will cause a cascading effect and amplify hallucinations in subsequent turns (See and Manning, 2021).
On the other hand, relying on human annotations is challenging due to error-prone collection protocols and human ignorance to complete the tasks with care (Smith et al., 2022).Prior research (Dziri et al., 2022c) shows that knowledge-grounded conversational benchmarks contain hallucinations promoted by a design framework that encourages informativeness over faithfulness.As studied by Dziri et al., when the annotators are asked to identify hallucination in a response, there is a high chance of error due to lack of incentive, personal bias, or poor attention to the provided knowledge.
See and Manning have studied different shortcomings in a real-time neural model.In this work, based on some of the findings of See and Manning, like repetitive and unclear utterances promoting hallucination, we extend the already defined modes of hallucinations (Maynez et al., 2020;Dziri et al., 2021a).Our contributions to this work are threefold: • We extend fact hallucination in KG-grounded dialogue systems into eight categories.To understand the degree to which our defined classes exist in real-life data, we conduct a systematic human evaluation of data generated by a state-of-the-art neural generator.• Since human annotation is expensive and often inaccurate, we design a series of novel perturbation strategies to simulate the defined ways of fact hallucinations and build a set of synthetic datasets collectively named as FADE (FActual Dialogue Hallucination DEtection Dataset).• We create multiple pre-trained model-based baselines and compare the performances on several constituent and mixed datasets.To assess our dataset's generalization capability, we perform zero-shot inference on BEGIN (Dziri et al., 2021b), and FaithDial (Dziri et al., 2022a) datasets, which encompasses all categories of hallucinated responses.
2 Different Modes of Hallucination in KG-grounded Dialogue Systems

Background
We focus on the task of detecting hallucinated spans in dialogues that are factually grounded on factoids derived from multi-relational graphs G = (V, E, R), termed as Knowledge-Graphs(KG Our study extends the work of (Dziri et al., 2021a) where they specifically explore two broad circumstances -extrinsic and intrinsic to the provided KG, under which LMs are likely to exhibit unfaithful behavior.Though this categorization is beneficial for detecting hallucinations, these categories can be further subdivided into subcategories, which are described in §2.3.

Base Dataset
We use OpenDialKG (Moon et al., 2019), a crowded-sourced English dialogue dataset where two workers are paired to chat about a particular topic(mainly movie, music, sport, and book).We use this dataset for training a GPT2-based model for generating data for human feedback analysis and creating the perturbed datasets.More details about the dataset can be found in §C

Definitions
We define below several categories of fact hallucination, comprehensive illustrations of each types are provided in Figure 2. In addition we have included detailed descriptions of each definitions in §A (a) (Extrinsic-Soft).An extrinsic-soft hallucination corresponds to an utterance that brings a new span of text which is similar to the expected span but does not correspond to a valid triple in G k c .(b) (Extrinsic-Hard).An extrinsic-hard hallucination corresponds to an utterance that brings a new span of text which is different from the expected span and does not correspond to a valid triple in G k c .(c) (Extrinsic-Grouped).An extrinsic-grouped hallucination corresponds to an utterance that brings a new span of text which is different from the expected span but is of a specific predefined type and does not correspond to a valid triple in G k c .(d) (Intrinsic-Soft).An intrinsic-soft hallucination corresponds to an utterance that misuses any triple in G k c such that there is no direct path between the entities but they are similar to each other.(e) (Intrinsic-Hard).An intrinsic-hard hallucination corresponds to an utterance that misuses any triple in G k c such that there is no direct path between the entities and they are not related in any form.
(f) (Intrinsic-Repetitive).An intrinsic-repetitive hallucination corresponds to an utterance that either misuses [SBJ] or [OBJ] in G k c such that there is no direct path between the entities but the entity has previously occurred in conversational history.. (g) (History Corrupted-Intrinsic/ Extrinsic).A history corrupted(intrinsic/extrinsic) hallucination corresponds to an utterance that is subjected to intrinsic or extrinsic hallucination which is influenced by hallucinated entities in conversational history.

Human Feedback Analysis
To study the extent to which the previously described modes of hallucination exist in a real-world system, we did human feedback analysis on responses generated using a GPT2-based generative model fine-tuned on OpenDialKG as described by Dziri et al..We sampled 200 responses each from four different decoding strategies, Greedy, Beam Search, and Nucleus Sampling, with a probability of 0.9 and 0.5.For each dialogue instance, we crowd-source human judgment by soliciting evaluations from 2 different annotators(with a high approval rating) from Amazon Mechanical Turk(AMT)(Details in §B).One computer science graduate student additionally verified the Human Intelligence Task (HITS).For examples where hallucination was present, we asked the workers to identify the type of hallucination(examples of different types of hallucinations were shown in the  1 we made these observations: • Extrinsic-soft hallucination is the dominant form of hallucination.Also, this bolsters our prior observation that LMs generate entities similar to the golden entity.• Comparatively less amount of hallucinations was seen in responses generated using beam search decoding scheme, though the percentage of extrinsic-hard hallucination was higher than greedy decoding.• Intrinsic-hard hallucination appears to be the least among all types.This suggests LM will always try to learn something from the given KG triples; generating something dissimilar will have a very low probability.

Dataset Creation
FADE is a collection of datasets consisting of component datasets created using several perturbations  and a set of mixed datasets constructed using the component datasets.

Perturbation Strategies
Extrinsic Hallucination All the entities present in OpenDialKG undergo a indexing process.At first, using Spacy we determine the named entity type 3 for each entity, and create BM25 indexes 4 for each entity type.Each KG triple corresponding to an entity is represented in this format -"[SBJ] [PRE] [OBJ]" and denoted as t i .Now, for an entity(e i ) we create a document is the number of KG-triples for that entity.After this, we index d i and e i in the index corresponding to the entity type.During the perturbation process, we retrieve all the KG-triples for the entity we want to perturb and form 3 queries for each triple by permuting ( Then based on the type of extrinsic hallucination, we query the indices to get the document scores in the following way: scores = average({BM 25(q i , d j )} i∈(s,r,o),j∈(0,n) ), the selection criteria of the perturbed entities are provided in table 2.
The groups for extrinsic-grouped hallucination are mentioned in Table 10.During the selection process, we iteratively check whether the perturbed entity exists in the conversation history, matches with the actual entity, and has appeared in the 1-hop sub-graph of the original entity.If an occurrence is found, we proceed to the following best entity.
Intrinsic Hallucination Here, we dynamically create a BM25 index and index all the KG triples in the 1-hop sub-graph of the original entity.Again, a KG triple is represented in the same fashion as in extrinsic hallucination -" The goal here is to select entities that are similar or dissimilar to the original entities and present in the 1-hop graph.To achieve that, we follow a hybrid triple retrieval approach to score each triple associated with the original entity.First, we use the final hidden layer of a pre-trained GPT2 to obtain initial embeddings for each node in G k c (for details, check §D.3).A query is formed by using Equa-  tion 1 each triple in G k c is scored using a similarity scoring system as described in Equation 3.
Here ε is a free term parameter ( §D.2), p(q i ) is unigram probability of the query term and v q i is the embedding for each query term(here query terms are [SBJ], [PRE] ,[OBJ] of the original entity).
n i in Equation 2 represents a triple embedding in G k c , when q(r) represents the rarity of the relationship term in the subgraph, high occurrence is penalized, rest terms are analogous to Equation 1.
Now, we query the BM25 index that we have created before with a simple query using the original triple: "[SBJ] [PRE] [OBJ]" and get the score for each of the triple(t).Finally, we get the final scores using Equation 4.
We select the perturbed entities based on the scores and selection criteria as defined in Table 3.Like extrinsic hallucinations, we iteratively filter the best-scored entity until it does not match the original entity or appears in history.
History Corrupted Hallucination Conversational history is corrupted using intrinsic or extrinsic corruption strategy.We select the last k turns of the conversation and randomly perturb the entities.We also ensure that at least 50% of the previous k turns are corrupted.

Dataset Analysis
Below we provide data statistics and characterize the composition and properties of the datasets that are generated using our proposed perturbation strategies.Table 5: Intrinsic hallucination data statistics

Data Statistics
Table 4 and 5 shows the statistics of datasets created using different perturbation strategies.The base dataset contains 77,430 data points.However, the perturbed turns in each of these datasets are quite low in comparison.This low number is because not every entity in an utterance has a valid KG path.
For extrinsic hallucination, ∼12,000 to ∼23,000 utterances were perturbed, and ∼550 to ∼11,300 utterances have multiple perturbations.The number of perturbed data points for intrinsic hallucination is less than extrinsic(∼9,000 to ∼18,000).The number of utterances with multiple perturbations is negligible due to the many checks the perturbed entities go through(for example, whether the KG path is present, has already occurred or not, etc.) To train and evaluate models, we vary the size of the train split in this range of 10% to 30% 5 with a step of 2.5%, keeping in mind to avoid overfitting.The remaining data is split into equal halves for validation and testing.

Parsing Features
In Figure 3 we show the top 10 Named Entity Recognition(NER) tags as identified by the Spacy library in extrinsic hallucinations.For extrinsicsoft hallucination, most NER tags are of type PER-SON.This corresponds to the fact that the original entities in the base dataset are primarily related to movies, books, and music.In extrinsic-soft hallucination, the associated PERSON name is changed to a closely affiliated person, or a movie name is changed to its director's name.In contrast, the distribution of NER tags is uniform for extrinsic-hard hallucination.Figure 4 and 5 shows the top-10 relations of the perturbed entity with the original entity in both intrinsic-soft and hard hallucinations and the corresponding value in their counterparts.In intrinsic-soft hallucination, more relevant relations are selected like "release year", "starred actors", "written by", etc.On the other hand, in intrin-5 sequential split

Mixing Datasets
Since in actual data, all kinds of hallucinations are expected to occur.We mix the previously constructed datasets in specific proportions to create a more challenging dataset.Table 11 shows the different mixing ratios for four types of datasets is as follows: Observed: We try to mimic the observed data, which is shown in §2.4,we take an average of percentages in for all the decoding strategies.Balanced: Goal here is to create a balanced dataset between hallucinated and non-hallucinated turns, each type of hallucination is also balanced.Extin-sic+: In this scenario, we increase the percentages of extrinsic-soft, hard, and grouped by a factor of 2, 1.5, and 1.5, respectively.Intrinsic+: here we increase the percentages of intrinsic-soft, hard and repetitive by a factor of 1.5.More details in §D.4.

Human Verification
To verify whether our proposed perturbation strategies inject hallucinations in the original data, we randomly sample 150 examples from each of the mixed dataset's test splits.Subsequently, these samples were randomly ordered to form a consolidated sample of 600 data points annotated by at least three AMT workers, with the same setting as described in §2.4.Additionally, the graduate student verified where the hallucinations adhere to the perturbation norms.Krippendorff's alpha were 0.88 and 0.76 among workers, and workers with perturbed data(average), indicating a very high agreement.Since our perturbation strategies are purely deterministic, we kept a large-scale human verification of the automatically annotated data outside the scope of this work.We create a human-verified dataset of 500 samples, 300 taken from this set and 200 from the human feedback study 2.4.

Task
To identify utterances that contain hallucinations and to locate the entities of concern.We create two tasks: 1. Utterance classification: Given the dialog history D, knowledge triples K n and the current utterance x n+1 we classify x n+1 is hallucinated or not.2. Token classification: Given D, K n and x n+1 , we need to perform sequence labelling on x n+1 and identify the hallucinated spans.

Baseline Models
As an initial effort toward tackling the suggested hallucination detection task, we create several baseline detection models based on pre-trained transformer models, including BERT, XLNet, and RoBERTa.These transformer-based models represent the state-of-the-art and can potentially better leverage context or embedded world knowledge to detect self-contradictory or anti-commonsense content.
For training the utterance classifier, given D, K n and x n+1 , we fine tune a pre-trained model M to predict binary hallucinated label y for x n+1 .Here, D and K n are considered as sequence A with token type ids as 0 and x n+1 is considered as sequence B with token type ids as 1.During inference, from the last hidden states H ∈ R l×h (h, l are hidden size and sequence length, respectively), then we obtain the representation w ∈ R h by max pooling(i.e., w = max_pool(H)).We then pass w through a MLP layer with a tanh activation to get the binary label y ∈ {0, 1}.During training time, we fine-tune the model using cross entropy objective between the predicted labels and the actual labels.
Similarly, for training the sequence classifier, we fine-tune a pre-trained model M s .At first, we encode D, K n and x n+1 using M s to get the last hidden states H ∈ R l×h , (h, l are hidden size and sequence length, respectively).Instead of doing a binary classification of each token, we adopt a BILOU encoding scheme.The hidden states are passed through an MLP layer with a tanh activation to get the 5-way label y ∈ {B, I, L, O, U }.
During training time, we fine-tune the model using a cross-entropy objective between the predicted and actual labels.

Experimental Setup
Baseline configurations we experiment with a variety of pre-trained models via Hugging Face Transformers, including BERT-baseuncased(110M), RoBERTa-base(125M) and XL-Net-base-cased(110M). Though using large or medium versions of these models will produce better results, we refrain from using those models as scaling large models in production is costly.More details about training parameters can be found in §E We also experimented with model architecture as follows: (i) Varied the length of the history (ii) Experimented with max/ mean pooling.(iii) Whether to concatenate the hidden states corresponding to K n with the hidden states corresponding to x n+1 before passing them through the MLP layer.(iv) Using a CRF layer instead of MLP for predicting labels in the sequence tagger.The best configuration uses 4 turns of conversational history, max pooling, it does not concatenate hidden states of K n with hidden states of x n+1 and uses a 2-layer MLP.
Evaluation metrics We evaluate the baselines with formal classification metrics, including precision, recall, and F1 for the hallucination sequence tagger.For the utterance-level hallucination classifier, we report accuracy, precision, recall, F1, and AUC (Area Under Curve) for ROC.We also use the G-Mean metric (Espíndola and Ebecken, 2005), which measures the geographic mean of sensitivity and specificity.We also employ the Brier Skill Score (BSS) metric (Center, 2005), which computes the mean squared error between the reference distribution and the hypothesis probabilities.

Results and Discussion
Baseline performance Table 6 and Table 7 show the baseline performance for the component datasets and mixed datasets.In both the settings, the utterance level hallucination classifier performs better than the token tagger in terms of F1.It can be inferred from in §3.4.Using the existing benchmark and baseline models, we also perform a zero-shot inference on the human-verified data.From Table 8, it is clear that the models fine-tuned on existing benchmark data cannot understand fact hallucination, especially when entities are misplaced.On the other hand, models trained on our datasets have F1 scores over 90% and outperform the current baseline by 10.16% and 17.5% in the two tasks using a pre-trained model with fewer parameters.This suggests that identifying abrupt fact hallucination is more challenging than other types of hallucination(like presenting more data than expected), which are more commonly exhibited in the benchmark datasets.
Generalisability We make zero-shot inference on BEGIN and FaithDial datasets' test splits.To make a fair comparison with the benchmark models, we further fine-tune roberta-large model on our datasets.Table 9 shows that F1 scores obtained from our best models underperform the best  performing baseline by 6% in BEGIN dataset and 10.17% in the FaithDial dataset.Even though the performance is low, we have to understand that the benchmark datasets contain hallucinations that are fundamentally very different from fact hallucinations.Also, we notice that models trained on intrinsic hallucination perform the best because the hallucinatory responses in the benchmark dataset do not deviate much from the evidence.To estimate how much training data is optimum for generalisability, we ran inference on benchmark datasets using models fine-tuned to 10% to 30% (with a step of 2.5%) data in train split.As shown in Figure 7 approximately 25% is found to be optimum.
Model Predictions We visualized the predictions on different datasets in Figure 6.Our models were able to easily identify the hallucinated entities as shown in Figure 6a here "The Departed" is a movie in which "Mark Wahlberg" has acted but is not related to the movie discussed in the context, i.e., "The Italian Job".Similarly, predictions made on the FaithDial dataset(Figure 6c) show that our models could produce accurate predictions when the response is generating something that is not expected, but the hallucination has similarities with the evidence.Our model sometimes fails to understand when the history is convoluted(Figure 6b)).

Related Work
Hallucination in Dialogue Systems Hallucination in knowledge-grounded dialogue generation sys-tem is an emerging area of research (Roller et al., 2021;Mielke et al., 2020;Shuster et al., 2021;Rashkin et al., 2021b;Dziri et al., 2021a).Prior work addressed this issue by conditioning generation on control tokens (Rashkin et al., 2021b), by training a token level hallucination critic to identify troublesome entities and rectify them (Dziri et al., 2021a) or by augmenting a generative model with a knowledge retrieval mechanism (Shuster et al., 2021).Though beneficial, these models are trained on noisy training data (Dziri et al., 2022b) which can amplify the hallucinations further.Closest to our work (Dziri et al., 2021a) has created a hallucination critic using extrinsic-intrinsic corruption strategies.In contrast, we create more fine-grained corruption strategies so that hallucinated data mimics the attributions of a neural chat module.
Hallucination Evaluation Recently several benchmarks have been introduced, such as BE-GIN (Dziri et al., 2021b), DialFact (Gupta et al., 2022), FaithDial (Dziri et al., 2022a) and Attributable to Identified Sources (AIS) (Rashkin et al., 2021a) framework.Though these methods can serve as a decent benchmarking system, their performance in detecting entity-level hallucination is unknown.In this work, we further contribute to this problem by proposing an entity-level hallucination detector trained on data created by various fine-grained perturbation strategies.

Conclusion
In this work, we have analyzed the modes of entitylevel fact hallucination, which is an open problem in KG-grounded dialogue systems.Through a human feedback analysis, we demonstrate that these KG-grounded neural generators manifest more nuanced hallucinations than straightforward studied approaches.We have proposed fine-grained perturbation strategies to create a dataset that mimics the real-world observations and create a series of datasets collectively known as FADE.Our entitylevel hallucination detection model can predict hal-lucinated entities with an F1 score of 75.59% and classify whether an utterance is hallucinated or not with an F1 score of 90.75%.Our models can generalize well when zero-shot predictions are made on benchmarks like BEGIN and FaithDial, indicating our perturbation strategies' robustness.This work can be extended by devising more sophisticated perturbation mechanisms, which can simulate other types of hallucinations.

Limitations
The major limitations of this work are as follows: • The token-level hallucination classifier and utterance-level hallucination classifier can have contradictory results; however, this happens in a small percentage of data.• Models trained on extrinsic datasets do not generalize well on the benchmark datasets, as the benchmark dataset contains hallucination mostly related to the evidence provided.A person, organization, political party, or part of a religious group can be related to each other."PERSON", "ORG", "NORP 2 Location, building, airports, infrastructure elements, countries, cities, and states can be interrelated "LOC", "GPE", "FAC" 3 A product, work of art, or law can be interrelated."PRODUCT", "WORK_OF_ART", "LAW"

A Definition Details
Table 10: Defined groups for extrinsic-grouped hallucination sample contains an extrinsic-soft hallucination as the entity in response -"Steven Spielberg" is similar to "Christopher Nolan", and it is not supported within 1-hop sub-graph.
(b) (Extrinsic-Hard).An extrinsic-hard hallucination corresponds to an utterance that brings a new span of text which is different from the expected span and does not correspond to a valid triple in G k c .
An extrinsic-hard hallucination occurs when injected knowledge is dissimilar to the expected entity and is not supported within the 1-hop sub-graph.It is easier to detect extrinsic-hard than extrinsicsoft as the entities are fundamentally different from the entities present in the 1-hop sub-graph.However, the entity type is retained, like an entity with a type "person" will be replaced by the same type of entity.Figure 8 shows an example of extrinsic-hard hallucination, where the golden entity "Christopher Nolan" is replaced by a different category of entity, "Joe Biden", but the type of entity is retained.
(c) (Extrinsic-Grouped).An extrinsic-grouped hallucination corresponds to an utterance that brings a new span of text which is different from the expected span but is of a specific predefined type and does not correspond to a valid triple in G k c .Like an extrinsic-hard hallucination, extrinsicgrouped hallucination introduces an entity that is functionally different from the original entity and not supported by the 1-hop sub-graph.The only difference is that the corrupted entity is not of the same type; instead, it is replaced by an entity of a similar type, defined in Table 10.For example, Figure 8 shows "Christopher Nolan" which is of type "person" is replaced by "Warner Bros." of type "organization".Here, the types "person" and "organization" are placed in the same group.
(d) (Intrinsic-Soft).An intrinsic-soft hallucination corresponds to an utterance that misuses any triple in G k c such that there is no direct path between the entities, but they are similar to each other.
Intrinsic hallucinations occur when the KG triples are misused, especially in intrinsic-soft hallucination an entity is selected from G k c which is For example, in Figure 9, "Christopher Nolan" is replaced with "The Dark Knight Rises" which is retrieved from the 1-hop sub-graph and has close relation with the original entity "Christopher Nolan".
(e) (Intrinsic-Hard).An intrinsic-hard hallucination corresponds to an utterance that misuses any triple in G k c such that there is no direct path between the entities, and they are not related in any form.
Similar to intrinsic-soft hallucination, it also misuses the information in KG triples.However, the similarity of the corrupted entity with the original entity is relatively tiny.For example, in Figure 9, "Christopher Nolan" is replaced with "United States of America".Although the corrupted entity is drawn from G k c , it is very different from the original entity.
(f) (Intrinsic-Repetitive).An intrinsic-repetitive hallucination corresponds to an utterance that either misuse [SBJ] or [OBJ] in G k c such that there is no direct path between the entities but the entity has previously occurred in conversational history..An entity from the conversational history is often repeated in the current utterances, which corresponds to intrinsic-repetitive hallucination.Here, an entity from the history which also occurs in G k c and of high relatedness, is swapped with the original entity.Figure 9 shows "Batman Begins" which is supported by G k c is replaced with "Christopher Nolan".Sometimes conversational agents are driven into a perplexed state, and we can witness hallucinations in most turns.So, this hallucinated history can trigger hallucination in the current utterance.This phenomenon can be seen both in extrinsic and intrinsic forms of hallucination. Figure 10 depicts extrinsic/intrinsic hallucination occurring in history -"The Dark Knight" is changed to "The Dark Knight Rises" for intrinsic hallucination; similarly, "The Dark Knight" is changed to "Spider-Man" for extrinsic hallucination.Hallucinations in the current utterance happen as described in previous sections.

B AMT Instructions
We present the screenshot of the annotation interface in Figure 12, 12 and 13.Workers were paid an average of $7-8 per hour across all tasks.We agree that this annotation process has a high learning curve.Even workers with high approval rates made errors in the initial rounds of annotation.A graduate computer science student manually verified randomly selected samples and provided feedback to the workers.Feedback was given to the workers, especially when they selected the same answers for ten consecutive HITS.After sending feedback three times, all spammed HITS were discarded.

C OpenDialKG
We use OpenDialKG (Moon et al., 2019), a crowded-sourced English dialogue dataset where two workers are paired together to chat about a particular topic.The first speaker is requested to start the conversation about a given entity.The second speaker is assigned to write an accurate response based on facts extracted from an existing KG, Freebase (Bast et al., 2014).The facts represent paths from the KG that are either 1-hop or 2-hop from the initial entity.Once the second speaker responds, the first speaker continues discussing the topic engagingly, and new multi-hop facts from the KG are shown to the second speaker.The dialogue can be considered as traversing multiple paths in the KG.However, not all utterances within the same conversation are grounded on facts from the KG.The second speaker can decide not to select a path from the KG to form an answer and instead forms a "chit-chat" response.Overall, the dataset consists of four domains: movie, music, sport, and book, where each second speaker's utterance is annotated with paths from the KG.The KG corresponds to an extensive subgraph extracted from Freebase with ∼ 1.2M triples (subject, predicate, object), ∼ 101k distinct entities, and 1357 distinct relations.We use 77,430 data points in the dataset for constructing FADE.

D Perturbation Hyper-parameters D.1 Search Index Details
We use Solr in case of extrinsic hallucination.We use the BM25 index, defined by the class solr.BM25SimilarityFactory.We manually labeled 50 data points(for the entity type PERSON) for tuning the indexes through grid search.Grid-search conditions were as follows: b was varied from 0.3 to 0.9 with a step of 0.1 and k1 was varied from 0.8 to 2.0 with a step of 0.2.Following grid search, an optimum MAP score of 0.789 was found, with b = 0.9 and k1= 1.6.For the dynamic indexes that were created in the case of intrinsic hallucination, we

D.2 Free parameter & β optimization
We use a free term weight parameter(ε) in intrinsic hallucination to represent the queries and nodes.Similar to extrinsic hallucination we manually annotated 50 data-points and ran grid search for ε ∈ {10 −i , 2 × 10 −i ; i ∈ {1, 5}}, and found ε = 2×10 −4 to be the optimum value.We used the same technique for optimizing β, and the search space ranged from 0.1 to 0.7 with a step of 0.05.

D.3 KG embeddings
We follow the same approach (Dziri et al., 2021a) for generating the KG embeddings.OpenDialKG triples are also represented using a textual term called "render".For the triples containing this term, we pass it through to GPT2 and then extract hidden state representations for each entity's word piece and finally obtain a final representation by applying a MaxPool over the hidden representations.For entity mentions not described in "render", we get Table 11: Mixing ratios for different datasets their representations directly from the last hidden states in GPT2.

D.4 Mixing Ratios
Mixing ratios for creating the mixed datasets are defined in Table 11.Perturbed and non-perturbed samples are drawn randomly from component datasets.

E Implementation Details
The utterance and token level classifier are implemented using the Pytorch Huggingface Transformers library (Wolf et al., 2020).The following configuration were found to be best performing for each models, as shown in Table 12, 13, 14 and 15.The models were trained in a single NVIDIA A5000 GPU, the average running time for the base models were 2.5 hours, and for the large model was ∼ 5 hours.

Figure 2 :
Figure 2: Illustration of our defined categories of fact hallucinations in KG-grounded dialogue systems

Figure 3 :
Figure 3: NER distribution in Extrinsic-soft and hard hallucination

Figure 4 :
Figure 4: Top 10 relation in perturbed KG triples in intrinsicsoft hallucination

Figure 6 :
Figure 6: Positive and negative model predictions

Figure 7 :
Figure 7: Generalisation capability of RoBERTa-large model fine-tuned using multiple splits of intrinsic-history-corrupt dataset

Figure 11 :
Figure 11: Annotation interface for human feedback analysis(Instructions, part 1)

Dataset
Type Ext-Soft(%) Ext-Hard(%) Ext-Grp(%) Int-Soft(%) Int-Hard(%) Int-Rep(%) HC-Ext(%) HC-Int(%) N-Halluc(%) ∈ V are nodes denoting subject and object entities and [PRE] ∈ R is a predicate which can be understood as a relation type.Primarily, a neural dialogue system is guilty of generating hallucinated text when a valid path in the k-hop sub-graph G k c ∈ G of the original KG anchored around a context entity c does not support it.

Table 2 :
Extrinsic hallucination perturbed entity selection criteria

Table 3 :
Intrinsic hallucination perturbed entity selection criteria

Table 4 :
Extrinsic hallucination data statistics

Table 6 :
Test benchmark (numbers in percentages (%)) for component datasets, models trained on 25% of the total dataset.

Table 7 :
Test benchmark (numbers in percentages (%)) for mixed datasets, models trained on 25% of the total dataset.

Table 8 :
Performance of several benchmark models and models trained on FADE on the 500 human-verified data( *p-value < 0.001))