Evaluating Object Hallucination in Large Vision-Language Models

Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.


Introduction
Large language models (LLMs) (Zhao et al., 2023) have shown remarkable abilities to solve various complex tasks by following human instructions in a zero-shot manner.The success of LLMs drives the researchers to devise more powerful multimodal models based on the superior capacity of LLMs, to enhance the understanding of visual semantics (Alayrac et al., 2022;Li et al., 2023b).As an exemplified work, GPT-4 (OpenAI, 2023) has exhibited the exciting performance of LLMs on multimodal tasks and scenarios.
Despite the success of LVLMs, previous work has revealed that their main components, i.e., LLMs and VLPMs, both suffer from hallucination.Especially, LLMs tend to hallucinate unintended text (Huang et al., 2021;Bang et al., 2023), and VLPMs might generate nonexistent objects in the image (Biten et al., 2022) (termed as object hallucination).It is generally believed that the hallucination would degrade the model performance and greatly harm the user experiences in real-world applications (MacLeod et al., 2017;Ji et al., 2022).Therefore, it is natural to ask the question: does hallucination still exist in LVLMs?In this paper, we systematically evaluate the issue of object hallucination in existing LVLMs, which refers to generating contents that are inconsistent with ground-truth objects in the given image.
To conduct our study, we first use the CHAIR (Caption Hallucination Assessment with Image Relevance) metric (Rohrbach et al., 2018), and examine the hallucination degree of several representative LVLMs on the MSCOCO dataset.Our preliminary experiments (Table 1) show that most of LVLMs severely suffer from object hallucina- The image features a table with a variety of food items displayed in bowls.There are two bowls of food, one containing a mix of vegetables, such as broccoli and carrots, and the other containing meat.The bowl with vegetables is placed closer to the front, while the meat bowl is situated behind it.In addition to the main dishes, there is an apple placed on the table, adding a touch of fruit to the meal.A bottle can also be seen on the table, possibly containing a beverage or condiment.The table is neatly arranged, showcasing the different food items in an appetizing manner.
Provide a detailed description of the given image.

Instruction-based evaluation
Yes, there is a pear in the image.
Is there a pear in the image?

POPE
Yes, there is a knife in the image.
Is there a knife in the image?

Popular settings
Yes, there is a bottle in the image.
Is there a bottle in the image?tion, and are even more prone to hallucinate than small vision-language models.Besides, we find that the existing object hallucination evaluation method may not be best suited for LVLMs and further propose a Polling-based Object Probing Evaluation (POPE) method.The basic idea is to convert the evaluation of hallucination into a binary classification task by prompting LVLMs with simple Yes-or-No short questions about the probing objects (e.g., Is there a car in the image?).We show that such a method is more stable and flexible.Besides, by using different object sampling strategies, we validate that existing LVLMs are prone to hallucinate objects which frequently appear or co-occur in the visual instruction dataset.

Random settings
Our main contributions are as follows: (1) We conduct an empirical study on object hallucination for several representative LVLMs and find that they are highly affected by object hallucination.(2) We discuss the potential reasons behind this promblem, e.g., LVLMs tend to generate frequently appearing or co-occurring objects in the instruction corpora.
(3) We propose an object hallucination evaluation approach called POPE, which is more stable and can be easily extended to unannotated datasets.

Large Vision-Language Model
Since LLMs have been shown to be general task solvers in a zero-shot/few-shot manner, a number of studies are devoted to improving VLPM by integrating powerful LLMs for more accurate language understanding and generation (Zhu et al., 2023;Liu et al., 2023;Dai et al., 2023a).In this paper, we re-fer to the enhanced VLPMs with the integration of LLMs as Large Vision-Language Models (LVLM).
Generally speaking, an LVLM consists of a vision encoder, a language encoder (i.e., an LLM), and a cross-modal alignment network.The training of LVLMs is generally composed of three major steps.First, a vision encoder and a language encoder are pre-trained on large-scale unimodal data (i.e., image and text data, respectively).Second, these two encoders are aligned through image-text alignment pre-training, which enables the LLM to generate a meaningful caption for a given image.Third, the aligned model is further fine-tuned on image-text instructions, so that it can generate satisfactory answers w.r.t. to a natural language question regarding a specific image.Note that in the second and third steps, we can optionally finetune different components instead of performing full-parameter fine-tuning.
Once the visual encoder and the LLM are well aligned, the derived LVLM can demonstrate a superior visual understanding ability.It can not only grasp the visual semantics of objects in the image, but also deeply understand the linguistic semantics for these objects by leveraging the parametric knowledge in the LLM.Further, the LVLM can perform complex reasoning over the related concepts about these objects, thus achieving an improved performance on a variety of multimodal tasks, e.g., visual question answering (VQA).

Object Hallucination
Although LVLMs are powerful in solving visionlanguage tasks, they also suffer from the issue of object hallucination as VLPMs.In the literature of computer vision field (Rohrbach et al., 2018;Biten et al., 2022), object hallucination refers that the model generating descriptions or captions that contain objects which are inconsistent with or even absent from the target image.In general, object hallucination can be defined at different semantic levels.The most straightforward way is to define it over the object level, while more finegrained definitions might be concerned with the attributes or characteristics of objects.In this work, we focus on coarse-grained object hallucinations in the model-generated captions and leave finegrained object hallucinations such as the number, attributes, and positions of the object for future work.We present an example of object hallucination in Figure 1, where the hallucinated object "meat bowl","bottle", "beverage", "condiment" are generated by the underlying LVLMs.
The hallucination phenomenon hinders the safe use of LVLMs in real-world deployment, as it may result in unexpected consequences caused by these hallucinated objects (MacLeod et al., 2017).For example, due to an incorrect understanding of the external environment, an autonomous driving system would make wrong decisions when encountering unexpected events, which might lead to serious safety issues.In order to mitigate these issues, this work aims to study how object hallucination exists in LVLMs from an evaluation perspective.

Object Hallucination in LVLMs
In this section, we evaluate the object hallucination problem in popular LVLMs using an existing method.We first introduce the evaluation settings and then analyze the experimental results.

Evaluation Settings
Caption Hallucination Assessment with Image Relevance (CHAIR) (Rohrbach et al., 2018) is a popular metric for evaluating object hallucination in image captioning tasks.Given the ground truth objects in the image, CHAIR calculates the proportion of objects that appear in the caption but not the image.Existing work commonly adopts its two variants, i.e., CHAIR I and CHAIR S , which evaluate the hallucination degree at the object instance level and sentence level respectively.They can be formulated as: We select five recently released LVLMs, i.e., mPLUG-Owl (Ye et al., 2023), LLaVA (Liu et al., 2023), Multimodal-GPT (Gong et al., 2023), MiniGPT-4 (Zhu et al., 2023) and Instruct-BLIP (Dai et al., 2023a) and prompt them with following instructions to generate captions about images in MSCOCO (Lin et al., 2014): • I 1 : Generate a short caption of the image.
• I 2 : Provide a brief description of the given image.
Then, we calculate CHAIR on these captions.We leave more details about the introduction to the dataset and evaluated models in Appendix A.

Severity of Hallucinations.
As the evaluation results illustrated in Table 1, most instruction-tuned LVLMs suffer from the object hallucination problem, even more serious than small models, e.g., LLaVA (32.7) v.s.OSCAR base (13.0) on CHAIR S using Instruction 1.It indicates that object hallucination is an important problem for LVLMs and deserves to be concerned about.As a comparison, InstructBLIP hallucinates less than other LVLMs.A possible reason is that its visual instructions are collected from a wide variety of publicly available datasets, which are relatively short.In contrast, other LVLMs mostly employ the visual instructions generated by unimodal LLMs (Liu et al., 2023)   ally longer and more informative, but may involve unexpected descriptive information (hallucination inherent from LLMs) that is inconsistent with the image, which could mislead LVLMs.

Disadvantages of CHAIR.
As Table 1 shows, the evaluation results can be affected by other factors, e.g., instruction designs and the length of captions.Specifically, although the adopted two instructions have similar semantic meanings, LVLMs prompted by Instruction 2 can even result in doubled values of CHAIR metrics compared with those prompted by Instruction 1, and the performance order of some LVLMs also changes (e.g., CHAIR I values of LLaVA and MultiModal-GPT).It indicates the instability of the CHAIR metric when different instructions are employed.Besides, as CHAIR requires to examine whether the mentioned objects are hallucinated in the generated caption, it needs complex human-crafted parsing rules to perform exact matching, which has not been adapted to the special generation styles of LVLMs and may lead to misclassification errors.Thus, it is necessary to consider a more suitable method that can stably and conveniently evaluate the object hallucination problem in LVLMs.

Influence of Instruction Data on Object Hallucination
Considering their impressive performance on complex vision-language tasks (Chen et al., 2023;Bai et al., 2023;Li et al., 2023a), it is counter-intuitive that the hallucination problem of LVLMs is so severe.Since smaller VLPMs suffer less from object hallucination, it is possible that the visual instruction-tuning process of LVLMs exacerbates object hallucination.In this section, we investigate the influence of the visual instruction data.We first make two basic hypotheses in Section 4.1 and then conduct qualitative and quantitative analysis to verify them in Section 4.2 and Section 4.3.

Qualitative Analysis
We first qualitatively analyze the correlation between the appearance frequency and hallucination.
For the first hypothesis, we plot a bar chart between the top ten frequently appearing objects in MSCOCO and their hallucination times in the validation set of MSCOCO; for the second hypothesis, we select the top ten frequently co-occurring objects with "dining table" and also plot a bar chart to show their hallucination times across images that really contain "dining table".We show the results of MiniGPT-4, LLaVA, MultiModal-GPT and mPLUG-Owl in Figure 2. Obviously, with the decreasing of the occurrence frequency of objects (from right to left), there is a notable decrease in the hallucination times for all four LVLMs.It reveals that the frequently appearing and co-occurring objects in the visual instruction dataset are indeed more likely to be hallucinated by LVLMs.To better support our results, we also list the full statistics of all 80 COCO objects in Appendix B.

Quantitative Analysis
To further consolidate the above findings, we employ the top-k hit ratio (HR@k) to measure the consistency between the appearance frequency and hallucination times of objects, which is defined as: Hit@k(i) Hit@k(i, o) where HR A and HR C quantify the correlations between hallucination times and appearing and co-occurring frequency respectively.n is the total number of images, Hallucinated(i) denotes the number of hallucinated objects in the i-th example, Hit@k(i) denotes the number of top-k frequently appearing MSCOCO objects in Hallucinated(i), and Hit@k(i, o) denotes the number of top-k frequently co-occurring objects with the probing object o in Hallucinated(i).Therefore, HR@k can reflect the proportion of top-k frequently appearing or co-occurring objects in all hallucinated objects.We present the HR A and HR C (dining table) of top 30 objects in Table 2 and leave HR C @(chair) and HR C @(car) in Appendix C. The HR A @10 and HR C @10(dining table) of all LVLMs are near 0.5 and 0.6, respectively.It indicates that, on average, approximately half of the hallucinated objects in each image belong to the top 10 frequently appearing COCO objects, while more than half are among the top 10 frequently cooccurring objects with the objects already present in the image.When we broaden our observation to the top 30 objects, this proportion continues to increase.These findings further verify that LVLMs mostly hallucinate common objects in the visual instruction data and inspire us to design three sampling strategies in our evaluation pipeline.

POPE
In this section, we devise Polling-based Object Probing Evaluation (POPE), a simple yet effective approach for evaluating hallucination in LVLMs.We first provide an overview of POPE, and then evaluate the representative LVLMs with POPE.Finally, we discuss the stability and scalability of our method, and also analyze the impact of hallucina- tion on VQA task.

Overview of POPE
In the empirical results of Section 3, we have revealed the severity of the object hallucination problem in LVLMs and highlighted the limitations of the existing evaluation method, e.g., sensitive to instructions and biased to short captions.Besides, existing methods mostly rely on parsing the generated captions to extract the predicted objects, which usually require human-crafted complex rules and are still inevitable to omit or misclassify objects.
Therefore, we consider devising a more suitable method for the stable, fair and flexible object hallucination evaluation of LVLMs, namely pollingbased object probing evaluation (POPE).Specifically, POPE formulates the evaluation of object hallucination as a binary classification task that prompts LVLMs to output "Yes" or "No", e.g., "Is there a chair in the image?".In this way, by sampling objects that LVLMs are prone to hallucinate, we can construct a set of hard questions to poll LVLMs.As standard answers to these questions are just "Yes" or "No", we can easily identify them without complex parsing rules, and avoid the influence of instruction designs and caption length, thus guaranteeing stability, fairness and flexibility.
Definition.Given an image caption dataset, POPE focuses on constructing a set of triples, each of which consists of an image, multiple questions and their answers ("Yes" or "No").The formulated definition of a triple can be described as: where x denotes the image, q(o i ) is the question probing o i based on a template "Is there a/an <object> in the image?",o i is the i-th object to be probed, a i is the answer to the question ("Yes" or "No") and l denotes the number of polling questions per image.o i can be obtained either from annotations or the results of automatic segmentation tools like SEEM (Zou et al., 2023).We set the ratio between ground-truth and nonexistent objects as 1:1 for label balance.After constructing the evaluation triples, we can directly poll LVLMs with them and collect the predicted answers.
Pipeline.The whole POPE pipeline is presented in Figure 3.After obtaining objects in the image, we can start to building polling questions.Questions whose answers are "Yes" can be directly built using ground-truth objects, while questions with the answer "No" can be built by sampling from negative objects.Therefore, by devising different sampling strategies, we can validate whether LVLMs are prone to hallucinate specific objects, e.g., frequently appearing or co-occurring objects discussed in Section 4. Thus, we devise the following three sampling strategies: • Random Sampling: we randomly sample the objects that do not exist in the image.
• Popular Sampling: we select the top-k most frequent objects in the whole image dastaset that do not exist in the current image, where k = l 2 .
• Adversarial Sampling: we first rank all objects according to their co-occurring frequencies with the ground-truth objects, and then select the top-k frequent ones that do not exist in the image.Under the above three settings, we can build the evaluation questions of different difficulty levels.We evaluate previously mentioned LVLMs on them with the following metrics.

Dataset
Metrics.We adopt Accuracy, Precision, Recall and F1 score as the evaluation metrics.Accuracy reflects the proportion of correctly answered questions.Precision and Recall reflect the ratios of correctly answering questions whose answers are "Yes" or "No", respectively.F1 score combines the results of Precision and Recall and we select it as the major metric for evaluation.Besides, we also report the ratio that LVLMs answer "Yes" as a reference to analyze the model behaviors.

Evaluation on MSCOCO
We evaluate all the LVLMs with POPE built on the validation set of MSCOCO (Lin et al., 2014).We randomly select 500 images with more than 3 ground-truth objects in the annotations and construct 6 questions for each image (i.e., l = 6).
The results are presented in Table 3, where we can obtain a similar conclusion as in Table 1 that InstructBLIP performs the best, while LLaVA, MultiModal-GPT and mPLUG-Owl suffer more severe hallucination problem, whose F1 Score are below 70.It indicates that POPE can well estimate the degree of the hallucination problem in LVLMs.Besides, we find that LLaVA, MultiModal-GPT and mPLUG-Owl are extremely prone to answer "Yes" (near 99%).It reveals that these three LVLMs are over confident, leading to lower accuracy on questions with the answer "No".Furthermore, the performance of LVLMs consistently decreases, from random settings, to popular and adversarial.It is consistent with our findings in Section 4, as LVLMs are prone to hallucinate the frequently appearing and co-occurring objects.

Advantages of POPE
As previously stated, the current approach for evaluating object hallucination in LVLMs like CHAIR is instruction-based, which is hindered by LVLMs' sensitivity to prompts and requires object annotations and manually designed rules for evaluation.In contrast, POPE is more stable to prompt forms and can be easily extended to unannotated datasets.Its probing result is also highly consistent with model's caption.
Stability.Regardless of the variations in prompt templates, POPE requires LVLMs to answer simple closed-ended questions, which is less likely to introduce ambiguity compared to instruction-based methods.Such characteristic contributes to its stability.To validate it, we evaluate LLaVA using both POPE and CHAIR I with four different prompts for each.The evaluation results are presented in Table 4.It can be observed that the standard deviation of the F1 score is significantly lower than CHAIR I , which confirms that POPE exhibits higher stability when faced with different prompts.
Scalability.As mentioned before, with the assistance of automatic segmentation tools, POPE can be easily extended to datasets without annotations.To validate it, we adopt SEEM (Zou et al., 2023) to annotate images from three datasets (i.e., MSCOCO, A-OKVQA (Schwenk et al., 2022) and GQA (Hudson and Manning, 2019)) and build POPE based on the segmentation results.We evaluate InstructBLIP, MiniGPT-4 and LLaVA on them and report the results in Table 5 and The evaluation results are shown in Table 6.In-structBLIP performs the best under all settings, highlighting the importance of instruction-tuning on large visual instruction corpora.Note that since InstructBLIP has been trained on A-OKVQA, the result should be considered with caution.Furthermore, despite MiniGPT-4 achieving a higher F1 score compared to LLaVA, its performance on VQA tasks is relatively poor.A possible reason is that the instruction dataset of MiniGPT-4 only derives from image caption data, while LLaVA uses 158K visual instructions data involving complex visual questions.The results imply that the degree of hallucination may not be always consistent with the VQA performance and these two evaluation aspects are both important and should be considered in real-world applications.

Conclusion
In this work, we conducted evaluation experiments on several LVLMs and examined how they suffer from the object hallucination issue.By investigating the reasons for object hallucination, we empirically revealed that the object distributions of the visual instructions would affect the object hallucination of LVLMs.Besides, we also found that the existing hallucination evaluation methods might be affected by the input instructions and the generated text of LVLMs, thus leading to less reliable evaluation results.To address this issue, we proposed a polling-based query method called POPE, to provide an improved evaluation approach for the object hallucination of LVLMs.Experimental results have shown that our proposed POPE can better evaluate the object hallucination issue of LVLMs.

Limitations
Despite that we have made extensive explorations, this work still has several limitations.First, we only focus on the object hallucination problem in LVLMs, while do not consider other aspects that can reflect the capacities of LVLMs.It means that the current evaluation task cannot measure the overall performance of LVLMs.In other words, if some model got a higher score in our evaluation setting, it does not necessarily indicate a stronger overall capacity than the one with a lower score.Second, due to the limitation of computation resources, we have to evaluate all models on a part of the validation set for each dataset.The reported results might be affected by the corresponding data distribution, though we have carefully set up the experiments.Third, our proposed POPE utilizes a matching-based method to determine whether LVLMs answer "Yes" or "No", while empirically, LVLMs may occasionally fail to provide answers explicitly containing these words, which may lead to inaccurate evaluation results.Fourth, when combined with the automatic segmentation tool, the objects would be annotated based on the label set by the tool, which may be inconsistent with the collected human annotations, leading to a divergence in evaluation results.Finally, this work has only compared a small number of LVLMs, without including some recently released or closed-source ones.We leave the evaluation of more LVLMs as our future work.
Although we have extensively discussed the hallucination issues of LVLMs, it does not indicate that we hold an negative opinion on their progress.Instead, it will be a very promising direction to develop LVLMs by leveraging the powerful LLMs.These models that were evaluated in this work have been excellent demonstrations for this direction.While, we do hope that our work can bring new ideas or insights to develop more reliable and human-aligned LVLMs.

A Details of Evaluation Settings
Dataset.MSCOCO (Lin et al., 2014) is a largescale image recognition, segmentation, and captioning dataset.Here, we randomly sample 2,000 images with annotations about contained objects and human-labeled captions from its validation set as our evaluation dataset.For computing the CHAIR metric on MSCOCO, we follow the settings in Rohrbach et al. (2018) which only considers 80 objects appearing in the MSCOCO segmentation challenge.
Models.The evaluated LVLMs basically consist of three parts: a visual encoder, an alignment model, and a large language model.All the above models have been tuned on collected visual instruction data.A detailed comparison (e.g., backbones and trainable components) of these LVLMs is shown in Table 7.We also collect the evaluation results of smaller VLPMs, i.e., OSCAR (Li et al., 2020), VinVL (Zhang et al., 2021), BLIP (Li et al., 2022b) and OFA (Wang et al., 2022) from Dai et al. (2023b) as baseline results.

B Additional Qualitative Analysis Results
To better validate our hypotheses, We expand the analysis scope to all 80 objects in MSCOCO and present the result in this part.
For hypothesis (1), we present the cumulative proportions of the hallucination times of all 80 COCO objects in Table 8.The table demonstrates that, across all models, the top 30 objects comprise approximately 70% of all hallucinated objects.For hypothesis (2), we present the cumulative proportions of the hallucination times of all COCO objects that co-occur with dining table in Table 9.We also arrange these objects by their co-occurrence frequency.Similarly, the top 20 objects comprise about 80% of all hallucinated objects.

C Additional Quantitative Analysis Results
We present the HR C results of two other common objects, i.e., chair and car in Table 10, which show a similar trend with Table 2.

D Results of SEEM-based POPE on A-OKVQA and GQA
We adopt SEEM (Zou et al., 2023) to annotate images from A-OKVQA and GQA and build POPE

Figure 1 :
Figure 1: Cases of object hallucination in LVLMs.Bold objects are ground-truth objects in the annotations and red objects are hallucinated objects by LVLMs.The left case is from the traditional instruction-based evaluation method, and the right cases are from three variants of POPE.
(a) Hallucination times of top ten frequently appearing objects, whose frequencies decrease from right to left.Hallucination times of top ten objects co-occurring with "dining table", whose frequencies decrease from right to left.

Figure 3 :
Figure 3: Overview of the POPE pipeline.Given an input image, POPE first extracts ground-truth objects in the image either from human annotations or with the help of automatic segmentation tools like SEEM.Then, POPE conducts negative sampling for nonexistent objects in the image under Random/Popular/Adversarial settings.Finally, the ground-truth objects and nonexistent objects are formulated into question templates to poll LVLMs.

Table 1 :
Results of CHAIR on VLPMs and LVLMs.I 1 denotes "Generate a short caption of the image" and I 2 denotes "Provide a brief description of the given image".
Dai et al. (2023b)average length of generated captions.The results of VLPMs (OSCAR, VinVL, BLIP, and OFA) are collected fromDai et al. (2023b).The best results in each block are denoted in bold.
. Such synthetic visual instructions are gener-

Table 2 :
Results on MSCOCO that quantify the correlations between the appearing/co-occurring frequency of objects and the hallucination times of LVLMs.

Table 3 :
Results of LVLMs under three evaluation settings of POPE on the validation set of MSCOCO.Yes denotes the proportion of answering "Yes" to the given question.The best results in each block are denoted in bold.

Table 4 :
Evaluation results of LLaVA on POPE and CHAIR with different prompt templates.

Table 5 :
are the results of POPE using ground-truth annotations, which are copied from Table3.The best results in each block are denoted in bold.

Table 6 :
Table 11 (presented in Appendix D).In Table 5, the performances of all LVLMs mostly follow the same trend as annotation-based POPE in Table 3, i.e., Random > Popular > Adversarial, and InstructBLIP > Evaluation results of LVLMs on POPE and VQA.For VQA tasks, we report the VQA score on A-OKVQA and Accuracy on GQA.For POPE, we copy the result under the random setting from Table 11.image captioning tasks.For VQA tasks, we evaluate the SEEM-based POPE and VQA scores of LVLMs on A-OKVQA and GQA datasets.Since LVLMs are prone to generate answers in an openended manner, we utilize ChatGPT to help parse the generated results to better evaluate the VQA performance.The details of evaluation settings are presented in Appendix E. For image captioning tasks, we evaluate the captions of 500 images in POPE with traditional metrics.The evaluation results are left in Appendix F.