ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Recently, Large Language Models (LLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current LLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a human-annotated dataset specifically designed for zero- and few-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we benchmark the fundamental visual commonsense knowledge of both unimodal LLMs and VaLMs. Furthermore, we analyze the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.


Introduction
With the breakthrough progress of Large Language Models (LLMs) in recent years (Brown et al., 2020;Zhang et al., 2022b), LLMs are gradually adopted as general-purpose API interfaces (e.g., ChatGPT 1 ).In addition to language, these intelligent agents, are further required to understand vision knowledge (Hao et al., 2022), especially the visual perception, which is crucial for real-world interactions such as commonsense reasoning (Talmor et al., 2019), recipe generation (Agarwal et al., 2020), and robotic navigation (Shah et al., 2022).
However, current studies lack a systematic evaluation on how well these widely-used LLMs and their variants are capable of visual understanding.Recent research proposes to evaluate the visual capability of models through visual commonsense evaluation (Bagherinezhad et al., 2016; Let me imagine... oh! Samoyeds are white.
What color is the fur of a Samoyed?Yes, it is smiling.
Is the dog in the picture smiling or not?Visual commonsense refers to the general visual knowledge that is commonly shared across the world, as opposed to the visual information that is specific to a single image.Visual Commonsense can be captured through a series of related images.Norlund et al., 2021).As shown in Figure 1, visual commonsense evaluation aims to evaluate the model's understanding of commonly shared human knowledge about generic visual concepts, including color (Bruni et al., 2012;Norlund et al., 2021;Zhang et al., 2022a), spatial relations (Liu et al., 2022), relative sizes (Bagherinezhad et al., 2016), etc.Despite their insightful investigations, these studies still have the following limitations from two sides: 1) data side: some research mines visual commonsense attributes based on frequency distributions in plain text corpora, which diverges from human visual perception and exhibits additional textual bias (Zhang et al., 2022a); 2) model side: most existing evaluations only focus on a specific model group, lacking a comprehensive exploration of various model families (Bagherinezhad et al., 2016;Norlund et al., 2021;Liu et al., 2022).
In this work, we propose that similar to human beings, models can also answer intricate visual commonsense questions with related images (illustrated in Figure 1).To this end, we introduce IMAGENETVC, a unified zero-and few-shot visual commonsense benchmark incorporating multiple sources of images (e.g., ImageNet (Deng et al., 2009), search images, and synthetic images).From the data side, IMAGENETVC comprises 4,076 highquality QA pairs, encompassing 1,000 ImageNet categories across diverse domains such as color, shape, material, component, etc.Moreover, as a human-annotated dataset, IMAGENETVC utilizes human visual perception to identify shared attributes across relevant images, avoiding textual bias and providing data that is more closely aligned with human knowledge.From the model side, besides unimodal LLMs, IMAGENETVC also enables the evaluation of various Visually-augmented Language Models (VaLMs) to investigate the effect of visual grounding, which compensates for the lack of images in previous benchmarks.
With IMAGENETVC, we conduct extensive evaluations on the most widely-used LLMs and VaLMs.We benchmark the visual commonsense capabilities of various LLMs such as OPT, LLaMA, and Falcon and assess the effect of visual grounding in VaLMs with multiple sources of relevant images.
We further analyze the co-founding factors that may affect the visual commonsense capability of models, such as model scale, in-context learning, and image sources.We highlight several experimental findings.These findings support the high value of our benchmark in assessing visual commonsense capabilities.
• Template-based datasets yield artificially inflated and unstable visual commonsense evaluation, while our manually constructed IMA-GENETVC provides evidence that visual commonsense remains challenging for LLMs.• We discover that the acquisition of visual commonsense is an emergent ability for LLMs.For instance, 1.3B could be a potential threshold for unimodal LLMs to emergent with visual commonsense on the component.• In-context learning enhances the understanding of visual commonsense tasks for both LLMs and VaLMs, not only reducing their variance across prompts but also calibrating the model confidence on visual commonsense.

Related Work
Large Language Models Text-only Large language models (LLMs) have exhibited outstanding performance across various textual commonsense tasks, benefiting from their training on extensive textual data (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020).However, the lack of vi-sual data (e.g., images) during pretraining restricts their visual commonsense capabilities (Li et al., 2023b).On the other hand, Visually-augmented Language Models (VaLMs) have gained popularity by integrating visual information into LLMs (Tsimpoukelli et al., 2021;Alayrac et al., 2022), which enhance the visual understanding capabilities of language models (Yang et al., 2022;Wang et al., 2022).
Visual Commonsense Evaluation Visual commonsense knowledge of visual concepts is a fundamental and critical aspect of AI systems seeking to comprehend and reason about the world (Yao et al., 2022;Dong et al., 2022).Previously, several datasets have been proposed to address specific attributes of visual commonsense, including Memo-ryColors (Norlund et al., 2021), ColorTerms (Bruni et al., 2012), RelativeSize (Bagherinezhad et al., 2016), and Spatial Commonsense (SpatialCS) (Liu et al., 2022).To evaluate general visual commonsense, Zhang et al. (2022a) introduced ViComTe, a template-based dataset consisting of various (subject, object) pairs (such as (sky, blue)).However, its reliance on pure textual input underestimates the visual capabilities of VaLMs.Furthermore, its utilization of template-based formats and automatic extraction techniques leads to substandard data quality and inherent textual biases.
In this work, we introduce IMAGENETVC, a human-annotated visual commonsense evaluation dataset that consists of 4K natural language QA pairs across various visual attributes, which supports both LLM and VaLM evaluation with multiple sources of images.We present detailed comparisons of IMAGENETVC with prior work in Table 1.

IMAGENETVC
Starting from ImageNet, we construct our IMA-GENETVC dataset in a multi-step crowd-sourcing pipeline, including 1) annotator training, 2) commonsense annotation, and 3) cross-check examination.An overall demonstration of our annotation process is illustrated in Figure 2.

Image Source
We selected ImageNet (Deng et al., 2009) as our image source because it covers a large number of commonly used objects in real-life situations, providing a diverse and representative image source.Additionally, the unified image format in ImageNet with : What is the color of the forest?
: What is the color of a Samoyed's body?dimensions of 256×256 pixels facilitates annotators' understanding of images and reduces feature engineering.Specifically, we used the widely-used ImageNet (ILSVRC) 2012 subset, 2 consisting of 1.4 million images from 1,000 object categories.

Prerequisite: Annotator Training
We posted online job listings on Amazon Mechanical Turk 3 and received over 500 applications from candidates with Bachelor's degrees or higher.To ensure dataset quality, we provided training with instructions and guidelines and a quick quiz to assess candidate understanding.Only candidates with scores larger than 95% are hired.

Phase 1: Commonsense Annotation
Figure 2 shows the commonsense annotation phase, where annotators are provided with category names and 50 randomly sampled images per category.To ensure that the QA pairs reflect visual commonsense rather than visual information tailored to specific images, annotators are instructed to focus on the visual features of each category rather than individual images.They are also provided with annotation examples and guidelines for rejection.The annotation UI and specifications for annotation can be found in Appendix A.

Phase 2: Cross-Check Examination
The primary objective of the cross-check examination phase is to conduct a rigorous screening and categorization of high-quality QA pairs that meet our requirements.This phase comprises two stages.In Stage 1, a category-level examination is performed, where three examiners are assigned to all annotated QA pairs in the same category.They are required to check all the pairs in the category based on the annotation instructions, rectify any grammatical errors, and eliminate low-quality or noncompliant pairs.Only the QA pairs that all three examiners approve are deemed acceptable.Stage 2 involves a sample-level examination.Although the examination in Stage 1 is efficient, examining  all QAs in one category simultaneously creates a misalignment with the final testing method (oneby-one QA) and introduces a distribution bias in the examination process.Therefore, in Stage 2, a more thorough sample-level examination is carried out.Three examiners are randomly assigned a QA pair with the corresponding category name from the entire dataset.They vote on whether to accept the QA pair and classify it into the following five subsets: color, shape, material, component, and others.Only the sample that receives the majority vote is approved for acceptance.
Our 60-day annotated dataset comprises 4,076 items (refer to Table 2) from 1000 ImageNet categories.It consists of 5 individual sub-tasks: color, shape, material, component, and others.More information and examples of IMAGENETVC can be found in Appendix B. All pricing strategy details and a hierarchical supervision process employed are elaborated in Appendix A.3 and A.4.

Dataset Evaluation
Unlike previous datasets which are template-based, IMAGENETVC comes from diverse real images associated with human-annotated descriptions, which can better represent real-world settings.To assess the strengths of our dataset, we conduct automatic evaluation and human evaluation in this section.
First, we implement GPT-Neo-1.3B(Black et al., 2021) with respective subsets of IMAGENETVC and ViComTe, a widely-used dataset, across different prompts. 4Results in Figure 3 indicate that, as a template-based dataset, ViComTe exhibits severe prompt bias, with substantial evaluation variance across different prompts.E.g., the model achieves only 2% accuracy with the prompt "X is of shape Y" but achieves 61% score with "The shape of the X is Y".Besides, compared with ViComTe, IM-AGENETVC containing region-based questions is more challenging to models.For example, with the suitably selected prompt on the color subset, the model achieves 40% accuracy on ViComTe but only 28% accuracy on IMAGENETVC.
We further conducted a human assessment between ViComTe, IMAGENETVC, and QA pairs automatically generated from ChatGPT.5 Specifically, we provided human annotators with sampled data from the two comparison datasets and asked them to vote for the better one considering diversity, difficulty, and factuality.As depicted in Figure 4, in more than 84% of cases, IMAGENETVC outperforms or matches the template-based ViComTe in terms of diversity and difficulty, which is consistent with the results in Figure 3.Moreover, our dataset demonstrates notably higher factual correctness compared to the data automatically generated by ChatGPT in more than 81% of cases.
To sum up, our data collection process ensured high-quality annotations with minimal bias and increased diversity, difficulty, and factuality, pro-viding a challenging dataset for advancing research in visual commonsense understanding.

Experiments
Our experiments primarily focus on two types of language models: LLMs and VaLMs.Both models have demonstrated promising capabilities in understanding visual information (Li et al., 2023b).

Visually-augmented Language Models
In our experiments, we mainly evaluate three widely-used open-source VaLMs: Z-LaVI (Yang et al., 2022), BLIP-2 (Li et al., 2023a), and MAGMA (Eichenberg et al., 2022).These VaLMs are mainly built on top of frozen LLMs and incorporate diverse mechanisms to integrate visual information.Further model details are provided in Appendix C.4.

Evaluation Methods
In this work, we focus on evaluating the zero-and few-shot visual commonsense of LLMs and VaLMs on IMAGENETVC.Following Schick and Schütze (2021) and Yang et al. (2022), we treat the zeroshot evaluation as a cloze test, transforming the QA pairs in IMAGENETVC into prompts like "[Question] The answer is [Answer]." 6.Formally, each QA pair is converted into a sequence of tokens x = {x 0 , ..., x i , ..., x n }, in which x i is the answer.In the few-shot setting, examples with the same prompt are concatenated before each QA pair.

LLM Evaluation
Given an LLM M, the sequence of input tokens x = {x 0 , ..., x n } will first be mapped to text embeddings e t = {e t (x 0 ), ..., e t (x i ), ..., e t (x n )} by the embedding layer e t ∈ M. Then we utilize the 6 All the prompts utilized for the evaluation of LLMs and VaLMs are shown in Appendix C.1.model to calculate the score for the answer y ∈ Y: where M ′ denotes the transformer neural network in M, P (•) is the output probability given by the model.Then we obtain a probability distribution over all answer candidates using softmax: We calibrate the prediction by normalizing the probability distribution following Zhao et al. (2021), to mitigate the bias introduced by prompt formats as well as few-shot examples.

VaLM Evaluation
We incorporate two types of image sources as additional visual inputs for evaluating VaLMs: images retrieved from the web and synthetic images.We adopt Google Image Search to retrieve relevant images and Stable Diffusion (Rombach et al., 2022) for image synthesis.Following Yang et al. (2022), for each QA pair, we utilize CLIP (Radford et al., 2021) to sort images from these two sources based on their similarity with the question and then preserve top-K images as the final image sources.We mainly evaluate two types of VaLMs: prefix-based VaLMs and ensemble-based VaLMs.
Prefix-based VaLMs Given a QA pair with an image v, prefix-based VaLMs (e.g., BLIP-2 and MAGMA) first utilize a visual encoder to transform the image into a sequence of visual embeddings e v = {e 1 v , ..., e m v }.Then, these embeddings are prefixed into the text embeddings of x and put into the frozen LLM backbone to calculate the score: The probability distribution with the image v is calculated over all answer candidates, which is the same as Eq (1).If K images are provided, the final distribution will be averaged over all images: {v (j) , x (j) } L j=1 with the same processing will be concatenated in front of each QA pair.
Since prefix-based VaLMs utilize frozen LLM backbones, they can be regarded as a conditional extension of text-only LLMs.Evaluations between these two model types facilitate a thorough assessment of the effect of visual grounding on visual commonsense ability.
Ensemble-based VaLMs Given the input tokens x and multiple images v = {v (i) } K i=1 , ensemblebased VaLMs (e.g., Z-LaVI) utilize a frozen CLIP model, which contains a text encoder f t and a visual encoder f v , to project the tokens x and the image v (i) into a shared representation space and compute the relevance score between them: Then, same as Eq (1) and Eq (2), the probability distribution over all answer candidates and across K images is obtained: where softmax(•) is a simplified denotation of Eq (1).The final ensemble score is calculated as a weighted sum over the output distributions of LLMs and CLIP: where w denotes the weight hyperparameter.

Experimental Details
We adopt Google Image Search to retrieve relevant images and utilize the newly released Stable Diffusion (Rombach et al., 2022) for image synthesis.7Following Yang et al. (2022), for each QA pair in IMAGENETVC, we obtain 100 images with each of the two methods.These 200 images are sorted using CLIP based on their similarity with the question.We preserve top-10 (K = 10) images for each QA pair as the final image sources.The other experimental details, such as the model implementation, hyperparameters, and computing resources are presented in Appendix C.3.

Main Results
The main evaluation results of LLMs and VaLMs on IMAGENETVC are shown in Figure 5. Here, we highlight several interesting findings.Falcon and LLaMA excel in all four presented LLM model families, especially on the color and component sub-tasks.As shown in Figure 5(a, b), Falcon and LLaMA consistently outperform OPT and GPT across various subsets with both experimental settings, despite the shape subset.Particularly, LLaMA achieves a zero-shot accuracy of 41% on the color subset, surpassing GPT-J with a considerable margin of 13%; Falcon yields the highest few-shot accuracy of 76% on the component subset, which amounts to 14% absolution improvement over OPT.We further present the few-shot results of the largest available LLMs in their own model family in Figure 5(c), where LLaMA-65B shows remarkable superiority over other counterparts.
In-context learning (ICL) not only improves the visual commonsense performance of LLMs but also reduces their variance across different prompts.Comparing the results in Figure 5(a) and 5(b), we found that given a few examples (i.e., with ICL), LLMs achieve consistent and remarkable improvement over themselves.For instance, LLaMA-7B with ICL achieves an average score of 62% across five sub-tasks, with a 12% improvement over the zero-shot result.We further show the performance distribution of LLaMA across different prompts in Figure 6, which illustrates that ICL not only improves the model's performance but also reduces its variance across different prompts.Further analysis is conducted in Section 5.
VaLMs improve the visual commonsense ability of their LLM backbones, despite small performance gains on the shape subset.As depicted in Figure 5(d, e), BLIP-2 shows remarkable superiority over OPT on the color and material subset, with average accuracy improvements of 17% and 15%, respectively, which indicates that incorpo- rating visual information indeed helps to improve LLMs' visual commonsense capabilities.However, the results also show that the performance gains of VaLMs are small on some sub-tasks: both BLIP-2 and MAGMA only achieve an 0.5% accuracy improvement on the shape sub-task, while Z-LaVI even has performance drops.This demonstrates that VaLMs still have wide room for improvement.ICL capability of VaLMs should be further valued.As shown in Figure 5(f), MAGMA with ICL achieves consistent improvements across all subsets over itself and the few-shot results of the frozen LLM backbone, indicating that ICL could also improve VaLMs' visual commonsense performances.However, ICL has been somewhat underinvestigated by previous VaLM research.For example, both Z-LaVI and BLIP-2 only support zeroshot evaluation, with the lack of ICL capability.We hope our research could draw more attention to future work on the ICL capability of VaLMs.

Analysis
We further investigate the factors influencing the visual commonsense capabilities of LLMs and VaLMs.For instance, we find that a decent scale (e.g., 1.3B) could be a potential threshold for textonly LLMs to learn visual commonsense.We then analyze several influencing factors of VaLMs, such as image sources and image numbers.

When (at what model scale) do text-only LLMs learn visual commonsense?
We show the zeroand few-shot results of three LLM families on the component subset in Figure 7. Take the component sub-task as an example, we find that a decent scale (e.g., 1.3B) could be a starting point for LLMs to emerge with visual commonsense on the com- ponent 8 : smaller models at sizes below 1.3B are unable to perform well on the task, with a performance close to random guessing (i.e., ~50% accuracy); while models with a size larger than 1.3B exhibit gradual performance improvements.For example, OPT-30B achieves 59% and 66% average accuracy with zero-and few-shot settings, respectively.
What is the effect of ICL on the calibration of LLMs?Ensuring the reliability of a model's predictions is a crucial aspect of model evaluation, as mis-calibrated models can lead to incorrect inferences and serious consequences in real-world applications (Guo et al., 2017).To this end, we conducted a calibration analysis to evaluate whether the model confidence on visual commonsense reliably reflects the actual probability of the prediction being correct.Our analysis focuses on the calibration of LLaMA-7B on the component subset of IMAGENETVC.Results in Figure 8 indicate that ICL significantly improves the calibration of the model, increasing the correlation between confidence and accuracy from r = 0.57 to r = 0.98.This is consistent with the aforementioned findings, suggesting that ICL improves the visual commonsense performance of LLMs.
How do image sources influence VaLMs?Table 3 shows the ablation results of the image sources used in VaLMs.As illustrated, BLIP-2 with extra image sources (e.g., SEARCH) brings large improvements.This supports IMAGENETVC's motivation, suggesting that previous visual commonsense evaluations undervalue VaLMs' potential, as 8 Please note that, as the evaluated LLMs (OPT and Pythia) both rely on the Pile (Gao et al., 2021) as their pre-training corpus, our findings may not be generalized to other LLMs.VaLMs' visual commonsense demands relevant images as input to be suitably stimulated.Among the various image sources, CLIP-ranked images yield the best performance, suggesting that aligning images closely with the question facilitates the generalization of related visual commonsense by models.Thus, we use ranked images as the default image source for our main experiment and analysis.
What is the typical number of images required to capture visual commonsense?We show the model's average performance on IMAGENETVC with various numbers of images in Figure 9.The results show that the model's performance increases as the number of top-ranked images gradually increases from 1 to 10, indicating diverse image sources help the model to capture general visual commonsense.However, the improvement is marginal when the image number is larger than 10.

Visual Commonsense in Other Models
It is worth noting that, as a general visual commonsense dataset, IMAGENETVC supports various types of models and evaluation settings.Except for the evaluation setting in our main results, we also evaluate several models in the setting of open-ended generation.Specifically, we select two widely-used multimodal models, 12-in-1 (Lu et al., 2020) and BLIP (Li et al., 2022) that are finetuned on the VQAv2 dataset (Goyal et al., 2017) and the famous RLHF model, ChatGPT (gpt-3.5-turbo)for evaluation. 9As illustrated in Table 4, the multimodal models finetuned on VQAv2 show strong performance on IMAGENETVC, especially on the color sub-task, with relatively small model scales (e.g., 583M of BLIP).ChatGPT with ICL achieves the best average accuracy score of 75.8% across all compared models.However, it still has a considerable performance gap with humans, which has an average performance of 93.5%.

Conclusion
In this paper, we introduced IMAGENETVC, a comprehensive human-annotated dataset for evaluating visual commonsense using both textual and visual inputs.We conducted extensive experiments to evaluate the visual commonsense of both unimodal LLMs and VaLMs using IMAGENETVC.
Our results demonstrate the varying degrees of visual commonsense knowledge present in different models, as well as the factors that contribute to the acquisition and enhancement of this knowledge.Additionally, we offer insights into the emergent abilities of LLMs and the strengths of VaLMs in the realm of visual commonsense.

Limitations
While our study provides valuable resources and insights into the visual commonsense knowledge of LLMs and VaLMs, several limitations need to be acknowledged.Firstly, due to the high cost of region-based human annotation, the IM-AGENETVC dataset only covers 1,000 ImageNet categories, which may not cover all real-world scenarios.Therefore, it is possible that models may perform differently on other types of images that are outside of the IMAGENETVC categories.Additionally, our study is limited to zero-and few-shot visual commonsense evaluation and only considers models that have been pretrained on 9 The evaluation details are illustrated in Appendix E. large amounts of text data.Thus, it remains unclear whether fine-tuning on visual commonsense tasks would improve the performance of LLMs and VaLMs.Therefore, this may not fully capture a model's ability to understand visual commonsense in real-world scenarios where prior knowledge may be available.Besides, although we explored the factors that affect the visual commonsense knowledge of large models, it is still challenging to interpret how the models acquire this knowledge.
Overall, our study provides a foundation for future work in the field of visual commonsense knowledge and its applications.However, additional research is necessary to address the aforementioned limitations and further advance the field.

Appendix A Annotation Details
In this section, we will provide a comprehensive overview of our annotation process, including the guidelines we follow, the user interface we use, the hierarchical supervision process we employ to ensure data quality, and our payment policy.

A.1 Annotation Guidelines
The annotation of IMAGENETVC involves observing 20-50 images of a given category, finding a vision feature of the category, checking if it conforms to most of the images and our commonsense of life, and then writing a simple question-answer (QA) about this vision feature.The QA should contain one question and one correct answer.
The vision features can be object-based (such as the color, shape, material, and spotted/striped patterns of the whole object) or region-based (such as the color, shape, and material of a certain part of the object).They are features that can be seen through the images.
The annotation pipeline involves looking at the 20-50 images given and finding a common vision feature of the category.For example, "The shape of the dorsal fin of the tiger shark is triangle".The annotators check if this feature conforms to their commonsense of life and if it is written.Then, one QA is created, such as "What is the shape of the dorsal fin of the tiger shark?Triangle".
The following rules must be followed during the annotation process: • The question should contain the name of the category.Otherwise, the submission will not be passed.• If the annotator cannot think of a question that can be written or the images cannot be displayed, the annotator can skip this category.• The first letter of the question needs to be capitalized.• The end of the question needs to be a question mark.• Please do not write lots of Yes/No questions.
These questions are more likely to be rejected.We encourage the annotators to write more diverse answers.
In the annotation examples, we describe how the correct QA is created.Annotators can write their own QA according to this pipeline.The rejected examples include cases where the QA has been written before, vision features cannot be found in the images, or the QA is not about vision features.Additionally, the QA should conform to our commonsense of life, be strongly related to the category, and be about a specific feature of the category.
In summary, the annotation guidelines of IMA-GENETVC involve observing images of a category, finding a vision feature, and creating a QA about the feature that conforms to our commonsense of life and follows the rules outlined above.These guidelines ensure the accuracy and consistency of the annotations, making the dataset suitable for use in various applications

A.2 Annotation UI
Figure 11 shows the annotation user interface used in our human annotation process for model knowledge assessment.The interface consists of three main parts: the task instruction, the annotation pipeline, and the most common cases we reject.The task instruction provides clear guidance for the annotators on how to write effective prompts to assess the model knowledge.The annotation pipeline displays the generated text by the model and allows the annotators to refine their prompts until the generated text matches the expected target answer.The rejected cases section provides examples of prompts that do not meet the criteria and serves as a reference for the annotators to avoid such mistakes.The user interface design is intuitive and user-friendly, which greatly improves the efficiency and accuracy of the human annotation process.

A.3 Hierarchical Supervision
To ensure high-quality annotation, we have implemented a hierarchical supervision process.During the annotation phase, examiners cross-check the annotation results, and annotators who are excessively rejected receive a warning.Those who exceed the warning limit are removed.Additionally, during the examination phase, a random sample check of the examination results is performed by five authors, and examiners with low-quality checks also receive a warning.The hierarchical supervision process guarantees the high-quality execution of the entire annotation process.

A.4 Payment Policy
We compensated the crowd workers with varying rates based on the workload and quality of their

B Details of IMAGENETVC B.1 Details of the Others subset
Annotated QA pairs that do not belong to the four specified sub-tasks (e.g., color, shape, material, and component) will be categorized into the Others subset.Therefore, The Others subset contains a more diverse range of QA samples, which is more challenging.Figure 10 illustrates the detailed composition of QA types in the Others subset, which covers various topics such as length comparison (21%), relative size (20%), living environment (16%), counting (12%), etc.

B.2 Answer Set
Considering that the results of open-ended generation are uncontrollable, we evaluate all models with constrained decoding in our main experiments.Table 5 shows the list of all possible answers in IMA-GENETVC.Besides, we noticed that LLMs tend to predict "yes/no" or numerical answers when evaluated on the others subset.Thus, we split the others subset into three small test sets, containing answer  types of "yes/no", numbers, and other answers, respectively.

B.3 More Qualitative Examples
We present additional qualitative examples in Table 6, which compare the predictions made by various models.The comparisons between OPT-7B and BLIP-2 demonstrate the effectiveness of incorporating visual information in enhancing the visual commonsense capabilities of LLMs.However, these leading models, including ChatGPT, also encounter difficulties in certain challenging cases, such as determining the color of a flamingo's beak tip.Besides, these examples highlight Chat-GPT's tendency to prioritize selecting the most commonly associated property of an object as the answer, rather than considering the properties of the specific region in question.We hypothesize that this behavior may be attributed to the higher frequency of these common attributes co-occurring with the object in the pre-training text corpus.

C Experimental Details C.1 Multiple Prompts
Table 7 shows all the intuitive prompt templates utilized to evaluate the LLMs and VaLMs in our main experiments.We do not tune the prompt for each subset in IMAGENETVC.
In our data quality experiments, we utilize the original prompts provided by Zhang et al. (2022a) for ViComTe evaluation while adopting the prompts in Table 7 for IMAGENETVC.

C.2 Details of Human Assessment
In the human assessment process of ImageNetVC, annotators were presented with pairs of mutually exclusive random 32 instances from both Ima-geNetVC and other datasets (e.g., ViComTe) each time.In the whole process of comparison evaluation, we conducted multiple rounds of human assessment based on the dataset containing fewer test samples.Specifically, there are 556 comparison pairs used for evaluation between ImageNetVC and ViComTe, and 510 pairs used for evaluation between ImageNetVC and ChatGPT generated data.
Annotators were instructed to evaluate these instances based on the overall quality of the data, choosing the better side when considering three factors respectively: diversity, difficulty, and factuality.The final results were computed and demonstrated as percentages in Figure 4.For instance, the Diversity score depicted in Figure 4(a) signifies that in 86% of cases, the evaluators found ImageNetVC samples to either outperform or match those from  10: Zero-shot probing results of visually-augmented language models (VaLMs) in IMAGENETVC.We report the mean accuracy (%) results in 5 different prompts.Numbers that are highlighted in orange represent the percentage of improvement and blue denotes the percentage of performance drop.

Answer List: [CANDIDATES]
[QUESTION] Please select the most possible answer from the above list.Please answer in one word.

Answer List: [CANDIDATES]
[QUESTION] Please only print the answer selected in the above list.Please answer in one word.
Prompt 3 [QUESTION] Please select the most possible answer from [CANDIDATES].Please answer in one word.
Table 11: The prompts we utilize for ChatGPT evaluation on IMAGENETVC." [CANDIDATES]" denotes the answer set of the evaluated subset in IMAGENETVC, as shown in Table 5.
AGENETVC in Table 9, including the extra model parameters (except the frozen LLM backbone), the architecture of the visual encoder, and the number of pretraining images.Since the VaLMs vary in implementation details (e.g., the visual encoder), we cannot make a direct (head-to-head) comparison between the VaLMs and leave it for future investigation.

D Details of Main Experimental Results
We provide detailed experimental numbers of our main results with LLMs and VaLMs in Table 8 and 10, respectively.For LLMs, we show the zero-and few-shot evaluation results of OPT, GPT, Pythia, Falcon, and LLaMA across various model scales.For VaLMs, we compare the performance of Z-LaVI, BLIP-2, and MAGMA with their frozen LLM backbones.

E Evaluation Details of Other Models
This section provides evaluation details of VQA finetuned multimodal models and RLHF models.
Multimodal Models For multimodal models such as BLIP, we adopt the evaluation settings used in VQAv2 (Goyal et al., 2017).Specifically, for each QA pair and its corresponding image, we evaluate the model using open-ended generation and obtain the output answer.Based on the experimental settings outlined in Section 4.1.2,we provide the top-10 ranked images for each QA pair and determine the final answer by majority prediction.

RLHF Models
We evaluate ChatGPT with constrained prompts and automatically compute the top-1 accuracy.The prompts utilized are presented in Table 11.

Figure 1 :
Figure 1: Illustration of Visual Commonsense.Visual commonsense refers to the general visual knowledge that is commonly shared across the world, as opposed to the visual information that is specific to a single image.Visual Commonsense can be captured through a series of related images.

Q 1 : 3 ⋯: 2 :
Is the dog in the picture smiling or not?Q What shape are Samoyed's ears?Q Do Samoyeds have spots on their backs?

Figure 2 :
Figure 2: An overall demonstration of the construction procedures of IMAGENETVC.

Figure 4 :
Figure 4: Human assessment of visual commonsense dataset from three aspects: diversity, difficulty, and factuality.IMAGENETVC outperforms ViComTe in terms of diversity and difficulty, while also demonstrating superior factuality compared to ChatGPT generated data.
Figure 5: Radar plots for five individual sub-tasks in IMAGENETVC.We show evaluation results with four experimental settings: (a, b) Zero-and few-shot evaluation with LLMs-7B; (c) Few-shot evaluation with the best LLMs in their own model family; (d, e) Zero-shot evaluation with VaLMs and their frozen LLM backbones; (f) Few-shot evaluation with VaLMs.The numbers along the radio axis denote the mean Top-1 accuracy (%) of models over 5 different prompts.The detailed results for drawing these plots are shown in Appendix D.

Figure 6 :
Figure 6: Performance distribution of LLaMA on the color subset.Models with ICL achieve higher performance and show reduced variance across prompts.

Figure 8 :
Figure 8: Calibration results of LLaMA-7B on the component subset.ICL greatly enhances model calibration, significantly boosting the correlation between confidence and accuracy from r = 0.57 to r = 0.98.

Figure 10 :
Figure 10: Detailed composition of the Others subset.

Figure 11 :
Figure 11: Screenshot of the IMAGENETVC annotation UI, featuring task instructions, the annotation pipeline, and the most common reasons for rejecting prompts during the annotation process.The interface displays 20-50 images of a given category, and the task instruction guides the annotator to identify a common vision feature and create a simple QA about it.The annotation pipeline includes checking the conformity to commonsense and avoiding pre-written QAs.Annotations are focused on visual features, not commonsense or non-visual attributes.

Table 1 :
Features and statistical information of ImageNetVC and prior related datasets.The '# Category' column indicates the number of object categories included, and'# Test' means the number of test samples in the dataset.
resent mean results, black diamonds represent outliers.Compared to the template-based dataset, ViComTe, IM-AGENETVC demonstrates notably reduced evaluation variance across various prompts.

Table 4 :
Evaluation results of multimodal models and ChatGPT on IMAGENETVC.We report Top-1 accuracy results obtained by open-ended generation.We also show human performance in the last row as a reference.

Table 5 :
The answer set of all subsets in IMAGENETVC.Inside the parentheses are attributes grouped into the same answer candidate.

Table 7 :
The prompts we utilize for LLM and VaLM evaluation on IMAGENETVC.