Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training. Furthermore, we show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents, showing a significant space for improvement.


Introduction
The acquisition of knowledge occurs in the pretraining of large language models (Brown et al., 2020;Chowdhery et al., 2022), demonstrated as their emergent ability to answer informationseeking questions in the open-world, where the questioner does not have easy access to the information.While prior works have analyzed models' capabilities to answer textual information-seeking (or info-seeking) questions, much less is known for visual info-seeking questions.For example, after taking a picture of the specific church in Figure 1, a person might want to know the date of construction, or who decorated the interior of the church.Although the entity is presented in the image (the specific church), the relevant knowledge (e.g., the date) is not.Given recent advances on pre-trained visual and language models (Alayrac et al., 2022;Chen et al., 2023b;Li et al., 2023), do these models also understand how to answer visual informationseeking questions?
To study this research question, a visual question answering (VQA) dataset focusing on info-seeking questions is inevitably required.However, not all VQA datasets meet this criterion.For example, by design, the majority of questions in datasets such as VQA v2 (Goyal et al., 2017) focus on visual attributes and object detection that does not require information beyond the image to answer.While models capable of answering these types of questions have the potential to aid visually impaired individuals (Gurari et al., 2018), there is a broader class of info-seeking questions that cannot be easily answered by sighted adults.Handling such questions (e.g., When was this building constructed?1955) is critical as they come closer to the natural distribution of human questions.
In this paper, we present INFOSEEK, a natural VQA dataset that focuses on visual info-seeking arXiv:2302.11713v3[cs.CV] 9 Oct 2023 questions.Different from previous VQA datasets, the testing subset of INFOSEEK is collected in multiple stages from human annotators to evaluate VQA where the question can not be answered from only the visual content (see a comparison of datasets in § 2).In addition to this manually curated test set, which enables realistic evaluation of info-seeking VQA, we also join annotations from a recent visual entity recognition dataset (Hu et al., 2023) with the Wikidata database (Vrandečić and Krötzsch, 2014), and employ human annotators to write templates to semi-automatically generate a large corpus of visual info-seeking QA pairs.Over 1 million {image, question, answer} triplets are generated to support fine-tuning multimodal models for info-seeking VQA.We split data to ensure memorizing knowledge during fine-tuning is useless -models either have to learn to use knowledge learned during pre-training or learn to retrieve knowledge from an external knowledge base.
Using INFOSEEK, we analyze the ability of stateof-the-art models to answer visual info-seeking questions.We found pre-trained vision-language models, such as models pre-trained end-to-end (e.g., PaLI-X by Chen et al.), and models pre-trained with frozen LLM (e.g., BLIP2 by Li et al.), both struggle to answer info-seeking questions in zeroshot, though BLIP2 outperforms PaLI-X by a margin.Surprisingly, after fine-tuning on our (large, semi-automatically curated) training set, PaLI-X yields a significant improvement and outperforms the fine-tuned BLIP2 models on queries that are unseen during fine-tuning.This suggests that while pre-trained PaLI-X has a significant amount of knowledge, it requires a small amount of finetuning data to fully awaken its capabilities.Furthermore, we show that INFOSEEK fine-tuned models can even generalize to questions and entity types completely unseen during fine-tuning (e.g., art & fashion).
When incorporating a visual entity recognition component, and conditioning models on the Wikipedia articles of the relevant entities, we show that models accessing such a knowledge base (With-KB) perform better overall than those that rely on knowledge learned during pre-training.However, end-to-end (No-KB) models were found better on certain classes of questions that require coarse-grained answers ("Which continent is this building located on?"), even on tail entities.Our experiment ( §5.2) further suggests that improv- ing visual entity recognition can drastically increase model's capability in answering visual infoseeking questions (from 18% to 45.6%), indicating a promising direction for future development.
2 The Need for a New Visual Information-seeking Benchmark While there have been plenty of knowledgeintensive VQA (KI-VQA) benchmarks, we show that none of these meet the criteria to effectively evaluate info-seeking VQA.Early efforts in this area, such as KBQA (Wang et al., 2015) and FVQA (Wang et al., 2017), were based on domainspecific knowledge graphs, while recent datasets like OK-VQA (Marino et al., 2019) and its variants such as S3VQA (Jain et al., 2021) and A-OKVQA (Schwenk et al., 2022) have improved upon this foundation by incorporating an opendomain approach and highlighting common-sense knowledge.Among the existing benchmarks, K-VQA (Sanket Shah and Talukdar, 2019) and Vi-QuAE (Lerner et al., 2022) are most relevant, but they have severe limitations in their question generation process, as discussed below.
Information Seeking Intent.The evaluation of models' ability to answer info-seeking questions requires fine-grained knowledge, which a person is unlikely to know off the top of their head.However, we found that 70.8% of OK-VQA questions 2 can be answered without the need to use a search engine, indicating the dataset primarily focuses on knowledge that is commonly known to people.Most OK-VQA questions are regarding coarse-grained knowledge that many people already know: What days might I most commonly go to this building?Sunday.One only needs to know the building type (e.g., Church) rather than the specific building (e.g., Dominus Flevit Church).This makes it unsuitable for evaluating pre-trained mod-els on long-tailed knowledge, where these models have shown weaknesses (Kandpal et al., 2022).
Reliance on Visual Understanding.In contrast to OK-VQA, the ViQuAE dataset aims to test finegrained knowledge of visual entities by pairing questions from TriviaQA (Joshi et al., 2017) with images.However, a significant portion of the Vi-QuAE questions (e.g., "Who betrayed him for 30 pieces of silver?") can be answered without looking at the images, as the questions often reveal sufficient information to determine the answer.For example, K-VQA only focuses on human subjects, while over 43% of questions in ViQuAE revolve around human entities (see Table 2).Such limitations hinder the evaluation of a model's knowledge across various entity categories and may result in reduced task complexity, as the evaluation may be limited to mere facial recognition.
To address these limitations, we present INFOS-EEK ( § 3), a new benchmark for pre-trained multimodal models on visual info-seeking questions.Our work builds on top of a visual entity recognition dataset, OVEN (Hu et al., 2023), which is designed to answer questions related to the identification of visual entities.We take visual info-seeking a step further by benchmarking info-seeking questions about visual entities, which allows us to test the pre-training knowledge of models beyond simply recognizing an entity.

INFOSEEK: A VQA Benchmark of
Visual Information-seeking Questions  To ensure INFOSEEK questions rely on visual understanding and prevent models from taking shortcuts in the question without using the image, we employ a two-stage annotation approach inspired by TyDiQA (Clark et al., 2020).This makes it unlikely questioners will have prior knowledge of the answer like SQuAD (Rajpurkar et al., 2016), ensuring questions with info-seeking intents (Lee et al., 2019).
Question Writing.Annotators are asked to write 3-5 questions about a visual entity based on their own curiosity and information needs.To aid the question-writing process, they are prompted with visual entity images, a short description (15 words) about the entity, and a list of Wikipedia section titles.This ensures that the questions reflect a genuine interest in learning about important aspects of the entity without seeing the answer.A set of annotation rules is employed to prevent trivial questions, such as questions about visual attributes.
Answer Labeling.For each entity, we randomly assign collected questions to a different group of annotators to label answers from Wikipedia.Annota-  tors were shown the Wikipedia article of the entity and asked to find a concise answer to the question: a text span that is as short as possible while still forming a satisfactory answer.In addition, annotators categorize questions into three types: TIME (e.g., year), NUMERICAL (e.g., height) and STRING (e.g., location).
Finally, we construct {image, question, answer} (IQA) triples by assigning images for the annotated QA pair of a visual entity, followed by human verification and clarification of the questions if multiple objects are presented in the image.Following TyDiQA (Clark et al., 2020), we measure the correctness of annotations and take the high accuracy (95%) as evidence that the quality of the dataset is reliable for evaluating visual info-seeking models.

INFOSEEK Wikidata : 1 Million Automated VQA Data from Wikipedia
Human annotation is valuable but costly for largescale evaluation.We thus scale up the dataset using a semi-automated procedure, transforming knowledge triples in Wikidata (2022-10-03) to natural language questions with human-authored templates, resulting in 1.3M examples over 11K visual entities covering 2.7K entity types (see Table 2).
QA Generation.We convert knowledge triples (subj, relation, obj) in Wikidata to natural language question-answer pairs for a selected list of 300 relations.For each relation, annotators write one or two question templates, which contain a placeholder for a hypernym of the visual entity (e.g., car) and a placeholder for unit measurements (e.g., inches) in numerical questions to avoid ambiguity.Finally, we construct the IQA triples by pairing images of a visual entity with corresponding QA pairs.3QA Pair Filtering and Subsampling.To ensure the questions are diverse and the answers can be referenced from Wikipedia, we filter out QA pairs when answers from Wikidata cannot be found in the Wikipedia article and subsample questions to balance the distribution of entities and relations.

Evaluation of INFOSEEK
Dataset Split.We design the evaluation split to prevent overfitting to the training set and focus on evaluating the generalization ability of the pretrained models.This includes the ability to answer questions of new entities and questions not seen during training.Particularly, we define two evaluation splits: (1) UNSEEN ENTITY, where a portion of entities are held out during training and only included in the evaluation; (2) UNSEEN QUESTION, where we hold out a portion of the QA pairs of seen entities for evaluation.Evaluation Metric.Three types of questions are evaluated differently: VQA accuracy (Goyal et al., 2017) for STRING and TIME; Relaxed Accuracy (Methani et al., 2020) for NUMERICAL.We applied different relaxing strategies for each question type, and averaged the accuracy for each question.Finally, we calculate the accuracy for each data split (UNSEEN QUESTION and UNSEEN EN-TITY), and take the harmonic mean of them as the overall accuracy (see Appendix).

Protocols and Models for INFOSEEK
Motivated by previous research on text-based question benchmarks (Joshi et al., 2017;Roberts et al., 2020), we introduce two evaluation protocols, i.e, BLIP2 & InstructBLIP.We utilize two pretrained vision-language models, i.e, BLIP2 (Li et al., 2023) and InstructBLIP (Dai et al., 2023).Both models share the same architecture, which trains a Q-former Transformer that connects a frozen vision encoder (ViT-g/14) to a frozen instruction-tuned language model (Flan-T5XXL (Chung et al., 2022)) to output text based on an input image and text.Particularly, Instruct-BLIP fine-tunes the BLIP2 model on 26 visionlanguage datasets (e.g., VQAv2, OKVQA) with a text instruction prefix, and claimed to show improved zero-shot performance on unseen vision-language tasks.Following Li et al. (2023), we fine-tune the Q-former of both models using the INFOSEEK Wikidata , for improved performance.PaLI-17B & PaLI-X.We experiment with two extra pre-trained vision-language models from the PaLI (Chen et al., 2023b,a) family given its SOTA performance.Particularly, we use PaLI-17B (ViT-e + mT5 XXL (Xue et al., 2020)) and PaLI-X (ViT-22B (Dehghani et al., 2023) + UL2-33B (Tay et al., 2022)), which are pre-trained on WebLI (Chen et al., 2023b) with 1 billion image-text pairs.Both models, which use non instruction-tuned language models, exhibit minimal zero-shot performance on INFOSEEK.Consequently, we fine-tune both models on the INFOSEEK Wikidata to improve their performance.

Models with KB Information
In this protocol, we explicitly model the path to answer info-seeking questions with two decoupled sub-tasks: (1) recognizing the visual entity grounded to the KB and (2) textual reasoning to answer the question.A hidden benefit of such pipeline systems is improved interpretability, because it is easier to locate the source of errors by  • Fusion-in Decoder (FiD): KB Reader.We experiment with a SOTA retrieval-augmented model, which reads information from a KB, to understand the value of Wikipedia articles in the KB.Specifically, the FiD (Izacard and Grave, 2020) model is employed, which takes N=100 retrieved articles as input and generates an answer.
The model is pre-trained with a T5 Large (Raffel et al., 2020) (2023).We conduct manual analysis and detect a common error made by InstructBLIP is its preference for generating coarse-grained predictions compared to BLIP2 (e.g., architect vs a person's name).This leads to a performance drop on IN-FOSEEK, which emphasizes fine-grained answers (see Figure 5).We hypothesize that this can be attributed to the instruction tuning datasets used for InstructBLIP (e.g., VQAv2 and OK-VQA), which share a less fine-grained answer distribution.Fortunately, fine-tuning on INFOSEEK Wikidata helps close the gap.

Results for With-KB Models
Models with KB access perform better.seeking questions by effectively utilizing visual recognition and language reasoning, specifically using the names of visual entities to convey information across modalities.When comparing the two Language QA models, we observe that the FiD model, which reads the Wikipedia article, achieves the highest generalization performance on INFOS-EEK Human by a significant margin.This suggests that access to relevant text content plays a crucial role in answering visual info-seeking questions.
Large headroom for improvement.to-end models such as PaLI, have a short barrel on fine-grained knowledge-intensive questions (i.e, TIME and NUMERICAL).It can perform well on other questions, which are more about querying attributes or resolving relations between entities (see Figure 6).Comparing With KB models, PaLM and FiD perform on par with each other on this automated evaluation data.However, when evaluated on the natural info-seeking human queries, FiD has a better generalization, outperforming PaLM on TIME (21.5 vs 14.6) and NUMERICAL (25.6 vs 21.3) questions from INFOSEEK Human significantly.One possible reason is that natural infoseeking questions written by people focus more on very fine-grained information, which is rare and hard to memorize for PaLM.In contrast, FiD can leverage Wikipedia articles to predict answers.Finally, we analyze the performance of different models according to the visual entity popularity and found unique advantages of end-to-end models (see Appendix).
Performance on Head vs. Tail entities.Although pipeline models with KB access are overall stronger, surprisingly, we observe that end-to-end models have a unique advantage for info-seeking VQA, particularly on the tail entities.Figure 7 presents a comparison of models, with group-wise performances on Wikipedia entities that are least popular (less monthly page views) to most popular (more monthly page views).The histogram is generated based on the average monthly Wikipedia pageviews in 2022, following (Mallen et al., 2022).Surprisingly, the results show that PaLI-17B outperforms the pipeline systems by a large margin on the tail entities, particularly for questions related to geographical information.We show some qualitative examples in Figure 9, for entities from baskets of different monthly page views.This suggests that there are many different routes to answer visual info-seeking questions and that pipeline systems that rely on an explicit decomposition of the VQA task may be redundant and susceptible to error propagation from the entity linking stage.
Whereas for end-to-end models such as PaLI, it is flexible to decide which route of reasoning is more appropriate to answer a given question.For example, one can answer geographical questions without knowing the identity of the visual entity, if other relevant visual clues are presented.Meanwhile, on the more popular head visual entities, a clear trend emerged showing that pipeline systems outperform end-to-end PaLI by a big margin.

Related Work
Pre-trained Vision Language Models.There has been significant growth in the development of vision-language models pre-trained on large-scale image-text datasets (Lu et al., 2022;Bao et al., 2021;Wang et al., 2022;Zhou et al., 2020;Radford et al., 2021).One line of research aims to augment a pre-trained language model with visual modality by learning a mapping from an external visual encoder to the frozen large language model (Alayrac et al., 2022;Li et al., 2023;Koh et al., 2023), to fully leverage textual knowledge from the language model (Xu et al., 2023;Dai et al., 2023;Liu et al., 2023;Zhu et al., 2023;Ye et al., 2023).
Knowledge-based VQA Models.Various approaches have been proposed to address knowledge-based VQA tasks (Marino et al., 2019) by incorporating external knowledge into visionlanguage models.One approach is to retrieve information from an external KB (Marino et al., 2021;Hu et al., 2022b;Wu and Mooney, 2022) and em-ploy a model (Izacard and Grave, 2020) to perform language QA (Gui et al., 2022;Lin et al., 2022).Other approaches transform the image into a text caption and use an LLM (Brown et al., 2020;Chowdhery et al., 2022) to answer questions (Yang et al., 2022;Hu et al., 2022a).We utilize both approaches to study the ceiling for improvement on INFOSEEK with the OVEN model (Hu et al., 2023).
Another concurrent work (Mensink et al., 2023) investigates similar challenges but emphasizes scalability and relies on model-generated annotations, as opposed to our human-annotated info-seeking queries.

Conclusion
We introduced INFOSEEK, a large-scale VQA dataset that focuses on answering visual information seeking questions.With INFOSEEK, we found that current state-of-the-art pre-trained visuallanguage models struggle to answer visual infoseeking questions requiring fine-grained knowledge, such as questions about time and numerical information of an visual entity.Our analysis using pipeline systems, which grounds visual entity to an external knowledge base, suggests that incorporating fine-grained knowledge into the pre-training process holds significant potential to improve endto-end pre-training models.

Limitation
INFOSEEK is limited to English language and future research could expand it to multilingual setting, leveraging articles in Wikipedia supported in other languages.While the primary focus of this work is on knowledge derived from Wikipedia, future investigations could explore extensions to other domains, such as medical information, art work, and incorporate emerging updates in Wikipedia (Iv et al., 2022).A Details of the Dataset.
In this section, we provide more details of the human annotation quality control and automatic data generation process.We summarize the statistics of INFOSEEK in Table 7 and show question prefix distribution in Figure 11 and entity distribution in Figure 12.
A.1 Human Annotation Quality Control Instruction and Training.We hire 30 full-time inhouse annotators to collect questions and answers in INFOSEEK Human .Annotators are native English speakers in the U.S. and aware of the purpose of the collected data.To ensure the quality of annotations in the INFOSEEK Human dataset, a comprehensive training process was designed and implemented for our annotators.This process involved a pilot study, in which annotators read the instructions and annotated a few sample examples, followed by a tutorial session and a quiz.The tutorial was conducted through an online video session and provided a comprehensive overview of the instructions while addressing common mistakes identified in the pilot study.Only annotators who passed the quiz were selected to work on the main task, with 30 annotators completing the training.We hire annotators at $17.8 per hour, which is higher than the minimum wage in the U.S., to fairly compensate annotators for their time and effort.The average completion time for stages one and two of the annotation task was 12 and 10 minutes, respectively.A screenshot of the annotation interface is provided in Figure 13.Annotation Procedure Stage 1 (Question Writing): As shown in Figure 13 (Top), annotators are shown with images of a visual entity on the left-hand side with a short description of the entity from Wikipedia below.On the right, we show a list of Wikipedia section titles of the entity and ask annotators to write relevant questions next to the section title.We prevent annotators from asking binary questions, asking visual attributes (such as color), writing questions  by rephrasing the description, copying entity name and section titles into the question, and avoiding writing ambiguous questions.
Stage 2 (Answer Labeling): As shown at the bottom of Figure 13, annotators are present with info-seeking questions to the entity collected from Stage 1 and a Wikipedia link of the entity.
For each question, annotators are asked to find a short span of answers (less than 10 words) from the Wikipedia page.They are asked to answer two questions: (1) "Can you derive the answer from the given Wikipedia page?" and ( 2) "What is the type of this question?"and select from three options (TIME, NUMERICAL, OTHERS).For each answer, they will then fill in the answer box (TIME: [year, month, day], NUMERICAL: [min, max, unit], OTHERS: [string]) and copy paste a short sentence from Wikipedia that contains the answer to the evidence section.We decided to exclude questions without answer spans from Wikipedia following TyDiQA-GoldP as the dataset is already hard enough and reserve these questions for future work.
Expert Feedback and Correction.Expert annotators provided regular feedback during annotation and conducted thorough post-annotation verification.The data was split into three batches, with annotators flagged and provided feedback for those who consistently made similar mistakes.After the completion of stage 1, questions that revealed the entity name, asked about the color or shape of an object, or were binary were automatically rejected.After stage 2, three expert annotators reviewed and processed the question-answer pairs, removing unqualified pairs and verifying the answer span from the annotated evidence sentence.
Rejected pairs may have included questions that were not answered by the annotated answer or were too general and resulted in an ambiguous answer.
The expert annotators also corrected the question type annotation and edited the answer span into the correct format, such as adding units for numerical questions or shortening long answer spans that exceeded ten tokens.Finally, the expert annotators reviewed the image-question-answer triples    to reject bad images or clarify the question when multiple objects were present in the image.For example, a building was specified when multiple buildings were present in the image.On average, it took 1.5 hours to verify 1000 triples, as the majority of images contained a single object.Following TyDiQA (Clark et al., 2020), we analyze the degree to which the annotations are correct instead of the inter-annotator agreement since the question may have multiple correct answers.In Table 8, human experts carefully judged a sample of 200 examples from INFOSEEK Human split.For each example, the expert reads through the Wikipedia page of the queried visual entity and finds the answer to the question.They then indicate whether the annotated answer is correct.We take the high accuracy (95%) as evidence that the quality of the dataset offers a valuable and reliable signal for evaluating visual info-seeking models.

A.2 Filtering and Subsampling
Filtering.To test the models' ability to answer visual information-seeking questions that require fine-grained knowledge, which can be learned from the pre-training corpus such as Wikipedia, we need to verify the consistency of answers between Wikidata and Wikipedia.Given that Wikidata and Wikipedia are crowd-sourced independently, some QA pairs created from Wikidata may not be present in the Wikipedia article or may have different answers.Therefore, we filtered out QA pairs where the answer could not be found in the Wikipedia article of the entity.We performed an exact string match to verify answers for string questions and used fuzzy string matching 4 with a substring ra-4 SequenceMatcher from difflib library tio greater than 0.9 if an exact match could not be found.For time questions, we applied an exact match to verify the year, month, and date.In some cases, the year of construction of a building varied by a year, so we allowed a +/-1 year deviation for the time question.For numerical questions, we used exact matching to verify the numbers in the article.However, in many cases, the units were different (meters or inches), or a range with a minimum and maximum was given.We used regular expressions to extract the number or range from the Wikipedia article and filter out the QA pairs if it is counted as incorrect based on the "Relaxed accuracy" in Section 3.3 in the main text.Based on a manual analysis of 200 randomly sampled QA pairs, we found that 97% of the of the IN-FOSEEK Wikidata could be found in the Wikipedia article.
Subsampling Questions.In order to achieve a more diverse set of questions in INFOSEEK, we applied a subsampling method to address the skewed distribution of crowd-sourced knowledge triples in Wikidata.The method followed the approach used in Zhong et al. (2022).This involved defining P (r, c) as the percentage of triples that contain the relation r and the subject entity's category as c.The P ′ (r, c) = 1/|(r, c)| was calculated as the average probability of a relationcategory pair and Image-Q-A triples were removed with increasing likelihood based on the probability r = 1−min(1, P (r, c) ′ /P (r, c)) 1/2 .Additionally, the same subsampling method was applied to balance the answer distribution for each relation.This resulted in the question prior baseline achieving a relatively low score (3.2) in INFOSEEK Wikidata , as shown in Table 4 in the main text.

A.3 Evaluation Metric.
There are three types of questions, i.e, STRING, TIME, and NUMERICAL, which are evaluated differently.Particularly, we adopt the VQA accuracy (Goyal et al., 2017;Marino et al., 2019) against multiple references for STRING and TIME questions, and utilize Relaxed Accuracy (Methani et al., 2020;Masry et al., 2022) for NUMERICAL questions.For STRING questions, we use the alias of answers from Wikidata as multiple references for INFOSEEK Wikidata (#avg = 4.5), and the human-annotated multiple references for INFOSEEK Human (#avg = 2.4).Exact Match: Correct if the prediction matches any one of the references exactly.
• prediction="USA", references=["USA", "U.S.", "United States of America", ...] → ✓ For TIME questions, the answer references account for different date formats of year/month/day.Meanwhile, we perform a relaxed match (with a one-year error tolerance) to measure the model's prediction, because it is quite often that historical events are only associated with estimated time.
Exact Match: Correct if the prediction matches any one of the references exactly.
Relaxed Accuracy: correct if prediction is in the reference range or the prediction range overlaps with the reference range of more than 50% 1) ref_min ≤ pred ≤ ref_max; 2) IoU ([pred_min, pred_max], [ref_min, ref_max]) ≥ 50% • reference= 10 cm → reference= [9, 11] # 10% tolerance • prediction= 10, reference= [9,11] → ✓ • prediction= [5,6], reference= [9,11] → × A single value prediction is counted as correct if it falls within the answer range, and a range prediction is correct if the intersection-of-union between the prediction and answer is greater than or equal to 50%.Finally, we calculate the accuracy for each data split (UNSEEN QUESTION and UNSEEN ENTITY), and take the harmonic mean of them as the overall accuracy.

Figure 1 :
Figure 1: While 70.8% of OK-VQA questions can be answered by average adults without using a search engine, INFOSEEK poses challenges to query fine-grained information about the visual entity (e.g., Dominus Flevit Church), resulting in a sharp drop to 4.4% ( §2).

Figure 2 :
Figure 2: Visual info-seeking models under the proposed No KB and With KB protocols.(a) End-to-end VQA models (such as PaLI (Chen et al., 2023b) or BLIP2 (Li et al., 2023)) that directly predict the answer from looking at the image and question; (b) Pipeline systems with access to a knowledge base (e.g.Wikipedia), with the option to link the queried subject to the Wikipedia use CLIP (Radford et al., 2021) and perform textual question-answering using PaLM (Chowdhery et al., 2022) or Fusion-in Decoder (FiD) (Izacard and Grave, 2020).

Figure 3 :Figure 4 :
Figure 3: Zero-shot & fine-tuned performances on INFOSEEK.Fine-tuning on INFOSEEK elicits knowledge from PaLI models to answer fine-grained visual info-seeking questions.

Figure 5 :
Figure 5: InstructBLIP (0-shot) makes less finegrained predictions compared to its initial model (BLIP2), after instruction-tuned on prior VQA datasets.eredfor future model development.We found InstructBLIP 0-shot performs significantly worse than its initial checkpoint, BLIP2 (7.4 vs 11.3 on InfoSeek Wikidata), which contradicts the superior zero-shot performances of InstructBLIP inDai et al. (2023).We conduct manual analysis and detect a common error made by InstructBLIP is its preference for generating coarse-grained predictions compared to BLIP2 (e.g., architect vs a person's name).This leads to a performance drop on IN-FOSEEK, which emphasizes fine-grained answers (see Figure5).We hypothesize that this can be attributed to the instruction tuning datasets used for InstructBLIP (e.g., VQAv2 and OK-VQA), which share a less fine-grained answer distribution.Fortunately, fine-tuning on INFOSEEK Wikidata helps close the gap.

• Q : Figure 9 :
Figure 8: Random examples from the training set of INFOSEEK Wikidata .

Figure 10 :Figure 11 :
Figure 10: Examples of predictions of BLIP2 (0-shot) and BLIP2 on INFOSEEK (Entity: Q457057).Fine-tuning improves the accuracy from 2/11 to 10/11, despite it being an UNSEEN ENTITY (not in the training set).We show training set entities that are located in Armenia and images of Amberd on the internet. landmarks

Table 1 :
Comparison of INFOSEEK and prior KI-VQA benchmarks.Performances reported in VQA score.

Table 3 :
Two evaluation protocols of INFOSEEK.The key difference is whether auxiliary data for visual entity recognition and knowledge base is available at training.I: image, Q: question, A: answer, E: queried visual entity.No KB and With KB, to evaluate models with different information accessible from INFOSEEK.Table 3 and Figure2have provided a comparison for the two setups.This key design choice is made to encourage models from different families to be compared with a clear notion of what information was accessed.We note that the No KB protocol is more challenging than the With KB protocol.
The No-KB protocol.Models are tasked to directly predict the answer by examining the image and question, similar to traditional VQA systems.This requires the model to store world knowledge in its parameters for effective question answering.The research question focuses on how much knowledge can an end-to-end model memorize in its parameters during pre-training, and how well can it utilize this knowledge after fine-tuning?We use the standard VQA formatted data, i.e, {Image (I), Question(Q), Answer(A)} triplets for training / validation, and {I, Q} for testing.The With-KB protocol.The goal is to analyze headroom for improvement when a viable reasoning chain is explicitly provided.Therefore, this protocol encourages an extra step of visual entity recognition, grounding the task on a knowledge base.The VQA task is transformed into a two-step pipeline, i.e, (1) visual entity recognition; and (2) language QA with entity information.We provide training signals to first recognize the queried visual entity and then leverage the information to query a large language model for answers, or identify relevant Wikipedia articles for extracting the answer.Specifically, we provide a 100K Wikipedia KB (articles and infobox images) that includes visual entities from INFOSEEK and top frequent entities from Wikipedia.During training and validation, With KB protocol provides entity labels for each queried visual entity.During testing, the model is evaluated based on the {I, Q} pairs only.4.1 Models without KB InformationRandom & Prior.Random answers sampled from the training set; The majority answer based on the question prior, which is calculated using the train-ing set questions grouped by question 4-gram.
(Radford et al., 2021)sk component.Sub-task #1: Visual Entity Recognition.We follow the entity recognition task defined inOVEN (Hu et al., 2023), and use an image and a text query (e.g., "What is this building?")asmodelinputs, and predict entities among 100K multi-modal Wikipedia entries.Particularly, we employ the pre-trained CLIP(Radford et al., 2021)model (ViT-L/14), as our visual entity recognition model, because of its strong generalization capability.Specifically, we follow the CLIP2CLIP model described in Hu et al., to fine-tune CLIP to encode multi-modal representations (image, question) from our dataset as query, and (Wikipedia image, Wikipedia title) from the KB as candidates.We then retrieve the top k=5 most similar entities based on weighted cosine similarity scores computed between the query and candidates."question: This is {entity} {question} answer:".

Table 5 :
Results of With-KB setting.CLIP →

Table 6 :
Results w.r.t. each question types on the INFOSEEK Wikidata val set of unseen question split, showing a big headroom for improvements on TIME and NUMERICAL for all end-to-end models.

Table 7 :
INFOSEEK Dataset statistics.Average question per image rate is 1.4 and 1.0 for Wikidata and Human split, respectively.

Table 8 :
Expert judgments of answer accuracy based on a sample of 200 examples from INFOSEEK Human .