A Simple Baseline for Knowledge-Based Visual Question Answering

This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA


Introduction
Knowledge-based VQA (KB-VQA) is a recently introduced VQA task (Wang et al., 2017(Wang et al., , 2018;;Marino et al., 2019;Shah et al., 2019) where the image alone is not sufficient to answer the given question, but effective utilization of external knowledge resources is additionally required.To solve such a task, a model would need not only strong visual perception but also reasoning capabilities while also being able to effectively incorporate world knowledge from external KBs (e.g.Wikipedia, etc) and LLMs.Systems capable of answering general and diverse questions about the visual world find a wide range of applications: from personal assistants to aids for the visually impaired and robotics 1 .
Recently, several works on KB-VQA (Gui et al., 2022;Lin et al., 2022) have emphasized the significance of incorporating both explicit and implicit knowledge.However, such approaches usually require complicated pipelines.Firstly, a KB (e.g.wikidata) covering world knowledge needs to be maintained and used for knowledge retrieval which is time-consuming and very sensitive to noise.Secondly, powerful LLMs such as GPT-3 (Brown et al., 2020) or OPT-175B (Zhang et al., 2022) are leveraged due to the huge amount of implicit knowledge stored in their parameters and their powerful reasoning capabilities through few-shot in-context learning.However, the computational or even actual monetary cost (e.g.cost for API access) associated with accessing such models renders them unaffordable for many researchers.Thirdly, it is crucial to train a fusion mechanism that can effectively reason by combining the retrieved explicit and implicit knowledge.

Main contributions:
We present a simple yet powerful pipeline for KB-VQA which by-passes the need for using most of the components of the above-mentioned systems.Specifically, the proposed system is simply based on few-shot prompting of LLaMA-13B (Touvron et al., 2023a,b).The key component of our method is the implementation of effective in-context learning using questioninformative captions as contextual information which, as we show, results in large accuracy boosts.
The proposed system features several advantages: (1) it is entirely training-free, requiring only a few examples for in-context learning; (2) it is based on the open-source LLaMA-13B (Touvron et al., 2023a,b) (considerably smaller than the widely-used GPT-3); (3) it is straightforward to reproduce; and (4) achieves state-of-the-art (SOTA) accuracy on the widely-used OK-VQA (Marino et al., 2019) and A-OK-VQA datasets (Schwenk et al., 2022).
2 Related Work on KB-VQA Methods Without LLMs: Several methods have been proposed including KRISP (Marino et al., 2021) which uses a multi-modal pretrained BERT (Devlin et al., 2019), MAVEx (Wu et al., 2022) which proposes to validate promising answer candidates based on answer-specific knowledge retrieval, and DPR which uses pseudo-relevance labels integrated with answer generation for endto-end training.Typically, these systems are not as competitive as the ones based on LLMs.Methods based on LLMs: PICa (Yang et al., 2022) is the first method to adopt GPT-3 for solving the KB-VQA task in a few-shot manner by just providing a few in-context VQA examples.Gui et al. (2022) proposed to use both implicit (i.e.GPT-3) and explicit (i.e.KBs) knowledge based on CLIP retrieval (Radford et al., 2021) which are combined by a novel fusion module called KAT (based on T5 or Bart).Lin et al. (2022) proposed to integrate local visual features and positional information (bounding box coordinates), retrieved external and implicit knowledge (using a GPT-3) into a transformer-based question-answering model.Hu et al. (2023) proposed PromptCap, a novel taskaware captioning model that uses a natural language prompt to control the generation of the visual content that can be used in conjunction with GPT-3 in-context learning.Img2Prompt Guo et al. (2023) is a zero-shot VQA method that generates imagerelevant exemplar prompts for the LLM.Their key insight is that synthetic question-answer pairs can be generated using image captioning and questiongeneration techniques as in-context exemplars from the provided image.Prophet Shao et al. (2023) proposes to prompt GPT-3 with answer heuristics (answer candidates and answer-aware examples) that are encoded into the prompts to enable GPT-3 to better comprehend the task, thus enhancing its capacity.

Methodology
While explicit knowledge retrieval focuses on semantic matching between an image and knowledge entries, it lacks implicit commonsense knowledge (e.g.Lemons are sour) which can be found in LLMs (Gui et al., 2022).LLMs are critical in extracting implicit knowledge due to the vast amount of implicit information embedded in their parameters, and their powerful reasoning capacity through few-shot in-context learning.Different from pre-vious work (Yang et al., 2022;Gui et al., 2022;Lin et al., 2022) we leverage the open-source LLM LLaMA-13B (Touvron et al., 2023a,b) instead of GPT-3 as an implicit language knowledge base and treat VQA as an open-ended text generation task.
Our method builds upon the pipeline of PICa, which is the pioneering work that utilizes GPT-3 for few-shot in-context learning in order to address the KB-VQA task.GPT-3 is a decoder-only autoregressive LLM of 175B parameters, trained on a diverse range of data sources, including Common Crawl, webtexts, books, and Wikipedia (Brown et al., 2020).During inference, in-context few-shot learning involves formulating a novel downstream task as a text sequence generation task using the frozen GPT-3 model.When provided with a testing input x, the target y is predicted based on a formatted prompt p(h, C, E, c, x).In this prompt, h represents a prompt head or instruction that describes the task, while E = {e 1 , e 2 , ..., e n } represents a set of n in-context examples (shots), where e i = (x i , y i ) represents an input-target pair of the task, where x i and y i are the input and target, respectively.These pairs are constructed manually or sampled from the training set.C = {c 1 , c 2 , ..., c n } represents a set of generic image captions describing each x i since images cannot be inputted to GPT-3.The caption for the test input is labeled as c.The target y is denoted as a text sequence consisting of L tokens, expressed as y = (y 1 , y 2 , ..., y L ).At each decoding step t, the following conditions apply: In order to utilize any LLM for the knowledgebased VQA task, the crucial step is to design suitable prompts.When given a question q i and an image v i as inputs, the VQA task's objective is to predict the corresponding answer a i .However, since LLMs do not inherently comprehend images, it becomes necessary to convert the image into a caption c i using a pre-existing captioning model.While SOTA pretrained captioning models have demonstrated impressive performance, they are primarily optimized to generate generic image captions.Unfortunately, these captions often fail to capture all the specific details required to accurately answer a given question about the image.In this work, instead of generic captions, we generate question-guided informative image captions using the Plug-and-Play VQA (PNPVQA) framework (Tiong et al., 2022) which identifies the most re-lated image patches to the question with a saliency map-based interpretability technique and generates captions from these patches only.For each image-question pair, we first generate 50 question-guided informative image captions from the image v i using PNPVQA.We then employ BLIP's (Li et al., 2022) text encoder to encode all the image captions and BLIP's image encoder to encode the image v i .We rank the image captions per image v i according to their cosine similarity with the image v i and keep the top-m most similar captions c i per example.After extracting the top-m most similar captions per image v i we construct a carefully designed text prompt consisting of a general instruction sentence, the captions C, the question, the test input's captions c, and a set of context-question-answer triplets (shots) taken from the training dataset that are semantically most similar to the current image-question pair (see Fig. 1).Then this text prompt is passed to a frozen LLaMA-13B model and in-context few-shot learning is performed in order to obtain its output as a promising answer candidate to the current imagequestion pair.

Selecting Informing Examples For
Few-Shot In-

Experimental Results
Comparative results on OK-VQA: Table 1 summarizes the results of various methods on OK-VQA including our best method (last row) which uses 9 question-informative captions and 5 query ensembles.When using LLaMA our approach outperforms all methods and achieves comparable results with Prophet especially when using the same shot selection strategy based on MCAN (Yu et al., 2019).Moreover, it performs better than Unified-IO and the 80B Flamingo which have been pre-trained with multimodal objectives.When compared to methods that rely on GPT-3 for implicit knowledge extraction, our approach outperforms PICa-Full which only uses generic image captions by 12.02% while outperforming the SOTA supervised methods KAT and REVIVE by 5.61% and 2.02% respectively.Finally, when using LLaMA 2 and MCAN-based shot selection strategy, our method achieves state-of-the-art accuracy of 61.2%.
Comparative results on A-OK-VQA: Table 2 summarizes the results of various methods on A-OK-VQA including our best method (last row) which uses 9 question-informative captions and 5 query ensembles.We compare our method to the strong baselines in (Schwenk et al., 2022) and the current state-of-the-art method Prophet (Shao et al., 2023).When employing LLaMA, our approach surpasses all other methods on the DA setting and achieves comparable results to Prophet, particularly when employing the same shot selection strategy based on MCAN.Finally, with LLaMA 2 and MCAN our method attains state-of-the-art performance on both the validation and test sets, achieving 58.6% and 57.5% accuracy respectively, demonstrating the effectiveness and robust generalization of our proposed method.

Ablation Studies
We conduct several ablations on OK-VQA to better understand the key components of our method.
Effect of question-informative captions: shows the performance of our method when using generic captions vs question-informative captions for in-context learning which is the key component of our system.Following Yang et al. (2022); Shao et al. (2023) we leverage the OSCAR+ (Zhang et al., 2021)  Effect of shot selection strategy: Table 4 shows that selecting random shots during in-context learning hurts the accuracy, confirming the findings of Yang et al. (2022).Retrieving shots based on the similarity between the test sample and the training examples yields a significant accuracy boost.Prophet's shot selection strategy based on MCAN also seems to be effective but we note that it is based on pre-training a vanilla VQA model on a different dataset (VQA-v2).
Effect of number of question-informative captions: Fig. 2 (a) shows the accuracy when we increase the number of captions per sample in the prompt during in-context learning.Here, we are using k = 5, and n = 10 when using 1-10 captions, and n = 5 when using more than 10 captions due to max.sequence length constraints.More captions provide more information for each example helping the model to make a more accurate prediction based on context.As shown in the figure, the validation accuracy keeps increasing up to 60.02%.When using more than 10 captions, the accuracy decreases but this also can be attributed to the fact that we are also decreasing Effect of explicit knowledge: We also tried to use KAT's (Gui et al., 2022) KB and trained a T5 (Raffel et al., 2020) in order to integrate explicit knowledge into our model.For each image, we used BLIP to extract explicit knowledge via imageto-text retrieval.We used 40 retrieved passages and LLaMA predictions as explicit and implicit knowledge, respectively.We achieved an accuracy of 58.70% which shows that our model does not benefit from such an approach.Effect of size of LLM: We also used a LLaMA-7B model using 9 question-informative captions, n = 10 and k = 5.Reducing the size of the LLM leads to decreased accuracy but the drop is not large, still obtaining 57.99% accuracy.

Conclusions
We proposed a simple yet effective baseline for KB-VQA.Our training-free method is based on in-context few-shot learning of the open-source LLaMA using question-informative captions.We show that this is sufficient to achieve SOTA results on the widely used OK-VQA and A-OK-VQA datasets.

Limitations
It is important to acknowledge that we have not explored the utilization of any other medium-sized LLMs apart from LLaMA, which presents a limitation of our study.Lastly, due to limitations in resources, we were unable to conduct experiments with larger sizes beyond 13B.However, it would indeed be intriguing to observe the performance when employing LLaMA models of sizes such as 30B or 65B.

Figure 1 :
Figure 1: Inference-time of our method for n-shot VQA.The input prompt to LLaMA consists of a prompt head h (blue box), n in-context examples ({c i , x i , y i } n i=1 ) (red boxes), and the VQA input {c, x} (green box).The answer y is produced in an open-ended text generation manner.In this example we use two questioninformative captions per example (separated by commas).
use more examples via multi-query ensemble.In-context Example Selection tries to search for the best examples for each inference-time input x among all available examples (Yang et al., 2022).We consider in-context examples that have similar question features as x.More specifically, given an inference-time question, we use BLIP's text encoder to obtain its textual feature and compute its cosine similarity with the questions in all available in-context examples.We then average the question text similarity with the image visual similarity to guide the example selection similarly to Yang et al. (2022).We select the top-n questions with the highest similarity and use the corresponding examples as the in-context examples.Multi-query ensemble: Given an inference-time example x, we use k × n in-context examples to generate k prompts.This way, we prompt LLaMA-13B for k times and obtain k answer predictions instead of 1 similar to Yang et al. (2022), where k is the number of queries to ensemble.Finally, among the k answer predictions, we select the one with the most occurrences (majority vote).

Figure 2 :
Figure 2: (a) Accuracy vs number of question informative captions used per shot during few shot in-context learning.(b) Accuracy vs number of prompts k used during in-context learning.

Table 1 :
Comparison with other methods on the OK-VQA dataset: Our method with 9 question-informative captions achieves state-of-the-art performance.
(Liu et al., 2022;Gui et al., 2022; number of examples n in the prompt.To better use these available examples we: (i) improve the example quality by careful in-context example selection(Liu et al., 2022;Gui et al., 2022; Shao et al., 2023), and (ii)

Table 2 :
Comparison with other methods on the A-OK-VQA dataset: Our method with 9 question-informative captions achieves state-of-the-art performance at the direct answer (DA) setting.Note that our method does not support multiple-choice (MC).

Table 4 :
as the captioning model.The results suggest using question-informative captions results in huge accuracy boosts (43.35% vs 57.56%).Accuracy when using different shot selection strategies.Avg.question and image sim.strategy retrieves shots based on the average cosine similarity between the test sample's question and image, and the training examples' question and image.MCAN latent space strategy retrieves shots that are closer to the test sample in the trained MCAN's latent space.