Hallucination Detection for Grounded Instruction Generation

We investigate the problem of generating instructions to guide humans to navigate in simulated residential environments. A major issue with current models is hallucination : they generate references to actions or objects that are inconsistent with what a human follower would perform or encounter along the described path. We develop a model that detects these hallu-cinated references by adopting a model pre-trained on a large corpus of image-text pairs, and fine-tuning it with a contrastive loss that separates correct instructions from instructions containing synthesized hallucinations. Our final model outperforms several baselines, including using word probability estimated by the instruction-generation model, and supervised models based on LSTM and Transformer.


Introduction
Performance of neural-network-based models on generating navigation instructions is substantially inferior to that of humans (Zhao et al., 2023).These models often hallucinate, generating references to objects or actions that do not exist or are impossible to execute in the environment.Similar behavior has been observed in language models in other domains of text generation (Raunak et al., 2021;Ji et al., 2023;Xiao and Wang, 2021;Lee et al., 2018;Guerreiro et al., 2022;Rawte et al., 2023).
Instructions containing hallucinations can confuse or misdirect humans, leading to frustration and sometimes even catastrophic mistakes.Detecting hallucinations is therefore essential to improve instruction generation models and inform risk to human users.Nevertheless, ground-truth wordlevel hallucination labels are typically not readily available in this domain.Meanwhile, hiring crowdworkers to annotate instructions can be very costly (Anderson et al., 2018b;He et al., 2021;Wang et al., 2022;Gao et al., 2022).
We propose a data-efficient weakly supervised approach to hallucination detection.Our approach reduces the necessary supervision in two ways.First, we leverage a pre-trained vision-language model (Guhur et al., 2021) that has learned transferable representations of path-instruction pairs through self-supervised learning.Second, we introduce data-augmentation strategies to create synthetic data with "free" hallucination labels.
We fine-tune the pre-trained model with the synthesized data using a contrastive learning objective to learn representations that separate positive examples (hallucinations) from negative examples (non-hallucinations).Our model outperforms various baselines in terms of F-1 scores on human-annotated evaluation data, beating an LSTM-and a Transformer-based models by 6.2 and 10.0 points, respectively.Ablation studies demonstrate the effectiveness of the proposed self-supervised pre-training and contrastive fine-tuning approach.We release the code, models, and data at https://lingjunzhao. github.io/hallucination_detection.html.
Grounded Instruction Generation.Instruction generation has been commonly studied in navigation settings (Anderson et al., 1991;Byron et al., 2010;Koller et al., 2010;Striegnitz et al., 2011;Goeddel and Olson, 2012;Fried et al., 2017Fried et al., , 2018;;Wang et al., 2022;Kamath et al., 2022).Recent work by Zhao et al. (2023) reveals a significant gap between the performance of models and humans.Our work constructs a model that can be useful for evaluating and enhancing instruction-generation models.Huang et al. (2019) and Zhao et al. (2021) train LSTM-based discriminative models with contrastive learning to score instructions.We follow a similar approach but focus on identifying word-level hallucinations, and effectively leverage a large pre-trained Transformer model.

Problem Setting
Grounded instruction generation.Our task takes place in an environment, where a speaker model S(u | r) composes an instruction u to communicate an imaginary trajectory r to a follower so that the latter can generate the same trajectory in the environment.An instruction is a sequence of words u i , whereas a trajectory is a sequence of observations o t and actions a t .We employ the Matterport3D simulator for experiments (Anderson et al., 2018b) which embeds a follower in a 3D model of a real-world residential building.The observation o t of the follower comprises of an RGB image representing the panoramic view at a location in a building, and orientation features encoding the follower's gaze direction.Each action a t moves the follower to a new location close to where it is standing and changes its observation.
Speaker model.We follow Zhao et al. (2023) to train a T5-based (Raffel et al., 2020) speaker model.This model encodes a trajectory into a sequence of hidden vectors and applies multi-headed attention on those vectors to generate an instruction autoregressively.It is trained on the Room-to-Room (R2R) dataset provided by the Matterport3D simulator.Detail about the model is provided in §A.1.
Hallucination in grounded instruction.Instructions generated by our speaker model often contain words that are inconsistent with the input trajectory.We refer to those words as hallucinations.Similar to prior work (Zhou et al., 2020), we observe two types of hallucinations: • Intrinsic hallucination is a word that needs to be replaced because it inaccurately describes an observation or action.For example, an instruction says "Walk past the reception desk and out the door on the right," but in the described trajectory, the door is on the left; • Extrinsic hallucination is a word that needs to be removed because it has no correspondence in the input trajectory.Our model typically exhibits this type of hallucination by repeatedly generating the same sentence, e.g., "Walk out of the office.Walk into the hallway and turn left.Walk into the hallway and turn left." We formulate hallucination detection as binary classification: given an input x = (r, u, i) consisting of a trajectory r, an instruction u, and an index i ∈ {1, • • • , |u|}, decide whether the word u i is a hallucination, i.e. whether it should be replaced or removed to make u consistent with r.
Candidate selection.For each instruction, we identify a set of candidate words for classification, which are (a) directional words like left, right, etc. (see §A.2 for a full list) as well as (b) nouns identified by the SpaCy part-of-speech tagger (Honnibal and Montani, 2017).

Architecture
We learn a classifier C(y = 1 | x = (r, u, i)) to decide whether a word u i is hallucinated.Our model is based on the Airbert model (Guhur et al., 2021), which inherits the ViLBERT architecture (Lu et al., 2019).An overview of the model is given in Figure 1.It implements two Transformers: one encodes the instruction u and the other encodes the trajectory r.We wrap the word to be classified u i between a pair of special tokens ([BH] and [EH]).Let h lang be the output of the language-encoding Transformer, and h vision be that of the vision-encoding Transformer.The model computes a score function s(x) = s(r, u, i) = w ⊤ (h lang ⊙ h vision ), where w is a learnable vector, and ⊙ denotes element-wise multiplication.More details about the model are given in §A.1.

Learning approach
Self-supervised pre-training.Instead of learning from scratch, we fine-tune a pre-trained checkpoint of the Airbert model.The checkpoint was first trained on a large collection of 1.4M images and 0.7M captions collected from AirBnB.It was subsequently adapted for a trajectory-instruction compatibility estimation task using the Room-to-Room dataset.The objective in each phase combines BERT-style pre-training (mask and pair prediction) with contrastive learning.We refer the readers to the original paper for an elaborate description of the pre-training phase.
Contrastive fine-tuning.We assume a dataset of contrastive pairs (x + , x − ).The positive and negative examples of a pair have the same trajectory r and word index i, but differ in the instruction u.The classified word in x − is a hallucination, whereas that in x + is not.For each pair, we compute the model scores s(x + ) and s(x − ), and construct the softmax distribution p = Softmax(s) where s = (s(x + ), s(x − )).We then train the model to recognize the positive example by minimizing the cross entropy between p and p ⋆ = (1, 0).This objective effectively forces the representation of the trajectory to be similar to that of the positive instruction and dissimilar to that of the negative instruction.At inference time, we define the hallucination detection classifier as C(x) = 1 − σ(s(x)), where σ is the sigmoid function.

Synthesizing data creation
Even for fine-tuning, acquiring human-labeled data can be prohibitively expensive.For evaluation, we manually annotated a small sample of labels ( §5).The annotation process was laborious, with an average time of 30 minutes required to annotate just 10 instructions.Based on our calculations, with a compensation of 15 USD per hour, it would cost approximately 9,000 USD to hire crowd workers to annotate all instances (∼12,000) in the R2R train-ing set.Thus, we propose a more cost-effective methodology for generating training data.
Synthetic negative examples.We start with a training example (u + , r) in the Room-to-Room training set and modify the human-written instruction u + to create instructions with hallucinations.We first extract the candidate words in the instruction ( §3).To create an intrinsic hallucination, we choose a candidate word and apply the following procedure: • If the word is a direction, we replace it with an alternative direction.E.g., "Walk downup one flight of stairs and stop on the landing.";• If it is a room, we substitute it with another room randomly selected from a pre-composed list.E.g., "Exit the bedroom balcony via the farthest left.Walk toward the couch.Stop there.";• Otherwise, we swap it for another word in the instruction that is neither a direction nor a room.E.g., "Exit the bedroom using the door step on the left then go straight until you get to the stairs and wait on the second step door."Using this procedure, we first generate an intrinsic hallucination in u + to synthesize u − .Then, with a probability of 0.5, we synthesize another intrinsic hallucination in each of u + and u − .This step makes the training instructions more similar to the test-time inputs, which may contain multiple intrinsic hallucinations as they are generated by imperfect speaker models.
To create an instruction with extrinsic hallucinations, we append a sentence, taken from u + or another instruction, to the end of a random sentence in u + .For example: "Walk out of the office.Walk into the hallway and turn left.into the hallway and turn left.".Every word in the added sentence is considered an extrinsic hallucination.We do not create additional intrinsic hallucinations in the instruction.
Alleviating input-distribution shift.Model trained only on human-written instruction may perform poorly on model-generated instructions.Therefore, we also include "high-quality" modelgenerated instructions on the R2R training set as positive examples and apply the same strategies to generate negative examples.The quality of an instruction is measured by the success rate of an ensemble of VLN ⟳ BERT instruction-following agents (Hong et al., 2021) in recreating the described trajectory.We consider a model-generated instruction to be of high quality if at least 80% of the ensemble agents can successfully reach the final location in the described trajectory.

Experiments
Data.Following the procedure described in §4.3, we generate a training set of 325,346 contrastive pairs.For evaluation, we use the same 75 evaluation trajectories in (Zhao et al., 2023) to form the test set.We randomly select another set of 20 trajectories in the R2R validation seen set for development.The environments in which the evaluation trajectories are generated are a subset of the train-ing environments.We use the speaker model to generate instructions from these trajectories.The first two authors then manually annotate word-level hallucinations, creating 209 development examples and 632 test examples.The final labels are decided by mutual agreement.We choose the decision threshold of a model to maximize its F-1 score on the development set.
Baselines.(i) random classifier assigns a label chosen uniformly at random, (ii) speaker model probability defines the hallucination probability C(x) = 1 − S(u i | r; u <i ) where x = (r, u, i), S is the speaker model ( § 3), and u <i is the instruction generated up to step i − 1 for the input r; (iii) LSTM and (iv) T5 are binary classifiers learned under a standard maximum-likelihood objective.They implement an encoder-decoder architecture based on LSTM and Transformer, respectively, and are trained using the same synthetic dataset as our proposed model.These models are initialized with random parameters.The detailed implementations and hyperparameters of all models are given in §A.1.1).The speaker-modelprobability is a remarkably strong baseline, despite not trained for hallucination detection.Its performance is on par with that of T5, which is the same model but trained specifically for hallucination detection.The LSTM-based model outperforms the T5-based models.Scaling up the size of the T5 model improves the recall score by 10 points.Our proposed model (fine-tuned Airbert) beats all baselines by wide margins in terms of F-1 score for hallucination labels, (+10.0 versus T5-base, +6.2 versus LSTM).It excels in precision compared to the baselines.We also include results on the development set in §A.3.Ablation studies (Figure 2).Our results confirm that self-supervised pre-training and contrastive fine-tuning are requisite to the performance of our model.Without pre-training, our model is just as bad as the LSTM-based model.We also compare fine-tuning via contrastive learning with fine-tuning via a maximum-likelihood learning.In the latter approach, the model simply takes as input an example (r, u, i) and learns to directly predict the true label.The approach underperforms contrastive learning by 4.9 F-1 points.Our finding aligns with previous work (Gunel et al., 2021;Zhang et al., 2021;Goyal et al., 2023), suggesting that contrastive learning is effective not only as a representation learning objective, but also as a classification objective.

Main results (Table
Error and Qualitative Analysis.In Table 2, we break down the performance of our model by word type.Our model struggles with detecting room and object hallucinations, indicating that its understanding of visually grounded words is lacking.Especially, it has relatively low recall on object hallucinations, potentially due to lack of diversity of this word type in the training data.Figure 3 shows a few successful and failure examples of our model.

Conclusion
This work is an early attempt to address the hallucination issue in grounded instruction generation.We have shown that techniques like self-supervised pre-training on multimodal data and contrastive fine-tuning on synthetic data are promising scalable approaches.We hope that these directions can be further developed in future work.

Limitations
Despite the effectiveness of the data generation method, this approach requires substantial domainspecific knowledge.Our method, particularly to generate directional hallucinations, is based on heuristics and does not take into account the actual environment.Another limitation is the small size of the evaluation datasets due to the expensive cost of annotation.

Figure 1 :
Figure1: Our hallucination detection model, which takes as input an instruction with a target word and determines whether it should be replaced or removed to be consistent with a visual trajectory.To build this model, we fine-tune pre-trained Airbert(Guhur et al., 2021) with a contrastive learning objective.

Figure 2 :
Figure 2: The effectiveness of self-supervised pretraining and contrastive fine-tuning.Results are F-1 scores of hallucination labels on the test set.
Walk up the steps and turn right .Walk up the steps and turn right … Gold highlight: Walk up the steps and turn right .Walk up the steps and turn right … (a) Success on detecting extrinsic hallucination: the second sentence should be removed entirely; the model marks all the candidate words in the sentence.Model highlight: … Walk past the bed and exit the bedroom … Gold highlight: … Walk past the bed and exit the bedroom … (b) Success on detecting intrinsic hallucination: the correct direction is to go to the left side of the bedroom, not exiting it.Model highlight: Walk past the couch and stop in front of the TV Gold highlight: Walk past the couch and stop in front of the TV (c) Model misidentifies the stopping location due to lacking depth information: the TV in the far left corner looks to be close to the true stopping location.Model highlight: Walk down the hallway and stop in the first doorway on your left Gold highlight: Walk down the hallway and stop in the first doorway on your left (d) Ambiguous direction: a slight left turn that appears like a straight walk in this viewpoint.

Figure 3 :
Figure 3: Some successful and failure cases of the fine-tuned Airbert model.The blue arrow indicates the described path, and the green represents the next location.

Table 1 :
Performance on the test set of our proposed hallucination detection model and various baselines.The decision threshold of each model is selected to maximize F-1 score of hallucination labels on the development set.

Table 2 :
Fine-tuned Airbert performance broken down by word type.Results are on test set.