Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters, PNP-VQA achieves an improvement of 9.1% on GQA over FewVLM with 740M PLM parameters. Code is released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa


Introduction
Recent years have witnessed unprecedented performance gains on many natural language reasoning tasks, especially in zero-shot and few-shot settings, being derived from scaling up pretrained language models (PLMs) and their training data (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020;Raffel et al., 2020;Black et al., 2022;Sanh et al., 2022;Wei et al., 2021).Inspired by their success, a natural thought is that utilizing PLMs should also boost zero-shot performance in visionlanguage reasoning tasks.
However, to leverage PLMs for vision-language tasks, most existing methods require non-trivial adaptation of the PLMs for the vision modality, which necessitates the design of new network components and training objectives.For example, Sung et al. (2022) and Alayrac et al. (2022) insert into the PLMs new layers that are trained from scratch.Tsimpoukelli et al. (2021) train vision encoders that output soft prompts to frozen PLMs.Chen et al. (2022) and Eichenberg et al. (2021) train both the vision encoders and new layers inserted into PLMs.In the zero-shot setting, various vision-language pretraining objectives are employed, such as image captioning (Alayrac et al., 2022) and imageconditioned masked language modeling (Jin et al., 2022).
From the perspective of general-purpose AI, the ability to perform new tasks by simply recombining large-scale pretrained models, or foundation models (Bommasani et al., 2021), without architectural changes or extra training would be highly desirable.Such a system would be able to dynamically adjust to previously unknown tasks by simply rewiring a small number of foundation models.However, to obtain high performance without some form of end-to-end training would seem difficult, if not impossible.
We present Plug-and-Play VQA (PNP-VQA), a framework for zero-shot visual question answering which conjoins large pretrained models with zero additional training and achieves state-of-the-art performance on zero-shot VQAv2 (Goyal et al., 2017) and GQA (Hudson and Manning, 2019).For the purpose of bridging the vision and language modalities, we employ a pretrained vision-language model (PVLM) (Li et al., 2022b) that describes visual information with textual captions.In order to obtain relevant and informative captions, we apply a network interpretability technique (Selvaraju et al., 2017) to detect image patches that are relevant to the question.After that, we generate captions stochastically for these image patches.Finally, we employ a PLM (Khashabi et al., 2022) to answer the question from the captions.
Research in cognitive science and neuroscience suggests that the human cognitive system is largely modular (Shettleworth, 2012;Bertolero et al., 2015).For instance, the pioneering work of Fodor (1983) argued that the low-level human cognition is constituted of several fast, autonomous, and domain-specific modules.For purely practical purposes, a modular design of artificial general intelligence would make it easy to harness rapid progress in each individual component, as the components can be individually replaced and updated without affecting other parts of the system.With this paper, we offer such a modular design for zero-shot VQA that leverages recent advances in PLM and PVLMs and combines them with an innovative application of network interpretability.
We summarize our contributions as follows: • We introduce PNP-VQA, a modular framework for zero-shot VQA without training.Its flexibility allows PNP-VQA to jointly evolve as pretrained models continue to advance.
• Besides natural language, we propose the use of network interpretation as the interface between pretrained LMs and VLMs.With an interpretability technique, we create image captions that extensively cover information relevant to the question, which enable accurate QA.
Adapting PLMs for zero-shot VQA has shown promising results.In order to incorporate vision information into PLMs, most existing methods perform additional vision-language training on image-text data.Frozen (Tsimpoukelli et al., 2021) trains the vision encoder while keeping the gigantic PLM frozen to retain its knowledge in question answering.The output from the vision encoder is prepended to the text as prompts to the frozen language model.FewVLM (Jin et al., 2022) finetunes the PLM using the prefix language modeling and masked language modeling objectives.VLKD (Dai et al., 2022) distills multimodal knowledge to PLM by using CLIP (Radford et al., 2021) as the teacher model during finetuning.Flamingo (Alayrac et al., 2022) adds additional layers to both the pretrained vision model and the PLM and trains the new layers on billions of image-text pairs.Different from the above work, PNP-VQA directly employs pretrained models with neither architectural modifications nor additional training.
Most similar to our work, PICa (Yang et al., 2022) converts an image to a single caption and adopts GPT-3 (Brown et al., 2020) for zero-shot VQA.In comparison, PNP-VQA generates multiple question-guided captions and performs fusion of captions after encoding to effectively utilize a large number of captions, yielding considerable performance gains.
An orthogonal research direction for zero-shot VQA is to train the VLMs on synthetic VQA examples generated from captions (Changpinyo et al., 2022;Banerjee et al., 2021).PNP-VQA does not require additional training.
Natural language as an intermediate representation or interface between different models or multiple steps of reasoning is an emerging machine learning strategy.It dates back to at least Andreas et al. (2018) and saw renewed interest in the past few months due to the prevalence of large PLMs.Andreas et al. (2018) and Vong and Lake (2022) learn natural language descriptions that function as few-shot classifiers within an image-text matching model.Bostrom et al. (2022)   (1) an image-question matching module that identifies image patches relevant to the question, (2) an image captioning module that generates a diverse set of captions, (3) a question answering module that generates an answer given the question and captions.For the image-question matching module and image captioning module, we adopt BLIP (Li et al., 2022b).For the question answering module, we adopt UnifiedQAv2 (Khashabi et al., 2022).Wu et al. (2022) chain PLM outputs and inputs.Zeng et al. (2022a) show that language-conjoined LM and VLM successfully perform captioning and retrieval but do not evaluate their models on VQA.In comparison, PNP-VQA adopts both natural language and network interpretation as the interface between different pretrained models.

Method
The central idea of Plug-and-Play VQA (PNP-VQA) is to establish an interface between a pretrained language model and a pretrained visionlanguage model without training.We demonstrate that natural language image captions and network saliency maps together serve as an effective interface.Ideally, the generated captions should thoroughly cover information that is present in the image and be relevant to the question.We foster relevance by identifying image patches most related to the question with a saliency map-based interpretability technique and generating captions from these patches only.Further, we promote coverage by injecting stochasticity, including random sampling of relevant image patches and of the textual tokens during caption generation.
The overall system architecture (Figure 1) consists of three modules: 1. an image-question matching module that identifies the relevant image patches given a question, 2. an image captioning module that generates a diverse set of captions from a set of image patches, and 3. a question answering module that outputs an answer given the question and the generated captions.
In this section, we introduce the three modules in detail.

Matching Image Patches and Questions
An image serves as a rich source of information, but the question at hand is likely focused only on particular objects or regions.Therefore, we encourage PNP-VQA to generate captions that describe image regions relevant to the question instead of generic captions with no specific aim.
We accomplish this goal by leveraging BLIP (Li et al., 2022b), a large-scale pretrained visionlanguage model that contains a network branch outputting a similarity score sim(v, t) between an image v and a text t.This branch, called Imagegrounded Text Encoder (ITE), employs a vision transformer (Dosovitskiy et al., 2021) that encodes the image, and a textual encoder that attends to the image features using cross-attention.As input to the image encoder, the image is equally divided into K patches.
To identify relevant image patches, we feed the image v and the question t to the ITE network and apply a variation of GradCAM (Selvaraju et al., 2017), a feature-attribution interpretability technique, that aggregates all cross-attention maps using weights from the gradients.Formally, let us denote image patch features as X ∈ R K×Dv , where K is the number of image patches and D v the image feature dimension.We denote textual features as Y ∈ R M ×Dt , where M is the number of textual tokens and D t the text feature dimension.For every cross-attention head, we have parameter matrices W Q ∈ R Dt×Dt and W K ∈ R Dv×Dt .The cross-attention scores, A ∈ R M ×K , can be written as The j th row of A indicates the amount of attention the j th textual token allocates to all image patches.At a selected layer of the ITE network, we compute the derivative of the similarity score w.r.t the cross-attention score, ∂ sim(v, t)/∂A, and multiply the gradient matrix element-wise with the cross-attention scores.The relevance of the i th image patch, rel(i), takes the average over H attention heads and the sum over M textual tokens: (2) where the superscript (h) denotes the index of attention heads.For every caption we generate, we sample a subset of K ′ image patches with probability proportional to the patch relevance.The captioning module sees the sampled patches only.
We provide the following motivation for the technique.The attention matrix A may be taken as indicative of patch importance.However, much redundancy exists among these matrices and many attention heads may be pruned with little performance loss (Bian et al., 2021), suggesting that some scores are uninformative.Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.
Figure 2 shows some examples of generic captions and question-guided captions with associated relevance heatmaps.We can clearly observe that question-guided captions contain more relevant information that helps produce the correct answers.Table 1 gives a quantitative analysis about the effect of different patch selection methods on zero-shot VQA performance across three datasets.Question-guided patch sampling substantially outperforms generic captioning using all patches and random patch sampling, especially when the number of captions is large.100 question-guided captions outperform the 5 human-written captions from MS COCO by 5.2% on VQAv2 and 6.0% on OK-VQA, demonstrating the merit of the proposed approach.

Informative Image Captioning
Even with relevant image regions, there may still be more than one way to describe these regions.Some descriptions may contain the desired answer to the question, whereas others may not.Without the ability to identify the answer a priori, we aim to generate maximally diverse captions to provide coverage of possible answers.
We adopt the image captioning network branch from BLIP (Li et al., 2022b) and apply stochastic top-k sampling (Fan et al., 2018) instead of beam search, which is known to produce dull and repeti-tive captions (Vijayakumar et al., 2018;Holtzman et al., 2020).The input to the network contains the K ′ image patches sampled according to relevance (see §3.1).We prepend a short prompt, "a picture of " as input to the text decoder.We repeat this process to generate N captions per image to encourage diversity of captions and coverage of visual content.To prevent repetition, we keep a generated caption only if it is not subsumed by any previous caption as an exact substring.

Answering the Question
The question-answering encoder-decoder model is pretrained on text data only and can only process text.Therefore, we include the question and the generated captions as input to the model.As discussed in §3.2, the image captioning module generates multiple diverse captions.To process such long inputs efficiently, we adopt the Fusion-in-Decoder (FiD) strategy (Izacard and Grave, 2021).
We illustrate the FiD strategy in Figure 3 by comparing it with the more straightforward Fusionin-Encoder (FiE), which concatenates the question and all captions into a long paragraph as input to the encoder.In contrast, FiD encodes each caption with the question separately and concatenates the encoded representations of all tokens from all captions.The result is fed as input to the decoder and is processed through the cross-attention mechanism.Since the time complexity of the self-attention mechanism scales quadratically with input length, whereas the cross-attention scales linearly with the encoder's output length, FiD is much more efficient than FiE.Further, FiE is constrained by the maximum input length of the encoder, caused by the positional encoding, but FiD does not have this constraint.Hence, with FiD, PNP-VQA can benefit from even more captions.
We plot the performance of FiD and FiE against the number of captions in Figure 4. Initially, both methods improve as the number of captions increases.However, the performance of FiE is capped at around 40 captions when the maximum input length is exceeded, whereas the performance of FiD continues to rise.

Datasets and Evaluation
We adopt multiple zero-shot VQA benchmarks, including the validation set (214,354 questions) and test-dev set (107,394 questions) of VQAv2 (Goyal et al., 2017), the test set (5,046 questions) of OK-VQA (Marino et al., 2019), and the test-dev set (12,578 questions) of GQA-balanced (Hudson and Manning, 2019).We include the VQAv2 validation set as a few recent works (Tsimpoukelli et al., 2021;Jin et al., 2022) evaluate their performance on this dataset only.We obtain the answer by open-ended generation and perform evaluation based on exact matching.We report soft-accuracy (Goyal et al., 2017) for VQAv2 and OK-VQA to account for multiple ground truth answer; for GQA, we report the standard accuracy.

Implementation Details
To obtain the image-question matching module and image captioning module, we adopt BLIP (Li et al., 2022b) with the ViT-L/16 architecture pretrained on 129M image-text pairs.The original BLIP-ITM and BLIP-Caption models further finetune on the 2017 train split of COCO Captions (Lin et al., 2014), which partially overlaps with VQAv2 and OKVQA.To prevent data leak, we instead finetune on the 2014 train split of COCO Captions, which does not overlap with the VQA evaluation datasets.We emphasize that this represents less, not more, training compared to the publicly released BLIP.
For the question answering module, we adopt UnifiedQAv2 (Khashabi et al., 2022) trained on diverse textual QA datasets.It is worth noting that UnifiedQAv2 is completely unaware of the visual modality during training.Therefore, its training data do not overlap with the VQA datasets.
Unless otherwise stated, we utilize a total of 100 captions per question.We select the 8 th crossattention layer of the ITE network for GradCAM.We sample K ′ = 20 image patches for the generation of each caption, and use k = 50 for top-k decoding (see Fig. 9 in Appendix B).For VQAv2 and OK-VQA, we apply FiD and encode the question with one caption at a time.However, for GQA, we encode each question with a group of 5 captions.GQA requires compositional visual reasoning and thus benefits from more contextual information per question.We perform experiments using LAVIS (Li et al., 2022a) on 8 Nvidia A100 GPUs.

Comparison with State of the Arts
We compare with state-of-the-art methods that formulate zero-shot VQA as open-ended answer generation.We categorize the methods based on how the pretrained networks are conjoined.In the first group, including VL-T5 no-vqa (Cho et al., 2021), FewVLM (Jin et al., 2022) (Alayrac et al., 2022), and Frozen (Tsimpoukelli et al., 2021), a vision encoder (VE) embeds the image as a dense matrix and feeds it to the pretrained language model (PLM).After that, the system performs a round of end-to-end vision-language (VL) training on tasks other than VQA, such as image captioning.VL-T5 no-vqa and FewVLM freeze the VE and finetune the PLM, whereas Frozen freezes the PLM and trains the VE.VLKD finetunes both the PLM and part of VE.Flamingo partially finetunes both the VE and the PLM.In the second group, the two foundation models are not jointly trained.Instead, they use language in the form of captions as the intermediate representation for an image.This group includes PICa (Yang et al., 2022) and our proposed model, PNP-VQA.
Table 2 shows the results.PNP-VQA outperforms previous methods by large margins on VQAv2 and GQA.On VQAv2 test-dev, PNP-VQA 11B outperforms the second best technique, Flamingo 80B (Alayrac et al., 2022), by 8.5%.PNP-VQA 3B outperforms Flamingo 80B by 7.2% despite its significantly smaller size and the similarsized Flamingo 3B by 14.3%.On GQA, PNP-VQA large outperforms the FewVLM large by 9.1%, with similar-sized PLM despite the lack of end-toend training.Only on OK-VQA, Flamingo performs better than PNP-VQA.OK-VQA requires external knowledge not existing in the images and cannot be solved by good captions alone.We hypothesize that the end-to-end training on the gigan-tic vision-language dataset of Flamingo induces a mapping between images and knowledge concepts that helps with OK-VQA.However, PNP-VQA is still better on OK-VQA than all other baselines that not trained on the gigantic Flamingo data.Compared with language-conjoined PICa (Yang et al., 2022) with 175B parameters, PNP-VQA 11B achieves a sizable improvement of 18.2%.
The results underscore the difficulty of zeroshot VQA using language models without any vision-language (VL) training.PICa, with its 175Bparameter language model, achieves comparable performance as FewVLM large , whose language model is 236x smaller but finetuned on VL data.On the other hand, finetuning the billion-scale language model could incur heavy computational cost and risk catastrophic forgetting (Tsimpoukelli et al., 2021;Alayrac et al., 2022).PNP-VQA demonstrates the feasibility of a different paradigm: using billion-scale pretrained language models for VQA with zero training.

Are PNP-VQA captions informative?
Intuitively, if the captions contain the correct answer, the QA model would have a higher chance to answer correctly.To measure the utility of captions, we compute the answer hit rate (AHR), or the proportion of questions for which at least one caption contains the ground-truth answer verbatim.
Here we exclude questions with yes/no answers as the meaning of "yes" and "no" can be contextual   and these two words appear rarely in captions.
Figure 5(a) shows the correlation between the AHR and VQA accuracy, computed over the VQAv2 validation set, for three techniques of image patch sampling: question-guided sampling, uniform random sampling, and all patches.We observe that, within each sampling method, the VQA accuracy increases as the AHR increases.This corroborates our hypothesis that the presence of the answer in the captions facilitates the generation of the correct answer.
The correlation between performance and AHR is not perfect, as AHR does not capture other factors that may affect the answer accuracy, such as the position of the answer in the sentence and the number of its occurrence.However, AHR provides an easy-to-compute and useful measure for the information quality of the captions.
Figure 5(b) shows how AHR changes with the number of captions.Among the three techniques, question-guided sampling produces captions with the highest AHR.Thus, we may attribute the good performance of PNP-VQA partially to its informative, question-guided captions that directly contain the correct answer.Further, as the number of captions increases from 20 to 100, question-guided AHR increases from 71.8% to 84.0%.This demonstrates the benefit of Fusion-in-Decoder, which allows PNP-VQA to utilize up to 100 captions.

How sensitive is PNP-VQA to the caption decoding method?
As the content of captions plays a crucial role in the performance of PNP-VQA, we investigate the sensitivity to the choice of the caption decoding methods.We test four methods, including the deterministic beam search and three stochastic methodstemperature sampling (Ficler and Goldberg, 2017;Caccia et al., 2020), nucleus sampling (Holtzman et al., 2020), and top-k sampling (Fan et al., 2018).We generate 100 captions from each method, and report the results in Table 3. PNP-VQA performs very similarly across stochastic decoding methods, but beam search results in a noticeable drop.Upon close inspection, we observe that beam search generates repetitive captions that do not sufficiently cover different aspects of the image.

Can PNP-VQA work with other textual QA models?
We experiment with two other PLMs as the question answering module for PNP-VQA: T0 (Sanh et al., 2022) and GPT-J (Wang and Komatsuzaki, 2021).T0 is an encoder-decoder model which is pretrained in a multi-task fashion on a collection of NLP tasks, including question answering.GPT-J is a decoder-only model, a much smaller open-source alternative to GPT-3 (Brown et al., 2020), which is pretrained with a task-agnostic language modeling loss on a large-scale text corpus.Table 4 shows that UnifiedQAv2 performs better on VQA tasks compared to T0 and GPT-J.We attribute Uni-fiedQAv2's good performance to the fact that it is a task-specific question answering model with superior textual QA performance.The result indicates that the choice of PLM is important when performing zero-shot VQA with zero training.The modular and flexible design of PNP-VQA leaves room for further performance improvements as more advanced PLMs emerge.

Conclusion
We propose PNP-VQA, a framework with zero additional training for zero-shot VQA by conjoining off-the-shelf pretrained models.PNP-VQA leverages an image-question matching module to determine image patches relevant to the current question.An image captioning module then generates question-guided captions, which are processed by a question answering module to produce an answer.PNP-VQA achieves state-of-the-arts performance on multiple VQA benchmarks.We hope that our work will bring inspiration for further research in flexible, modular AI systems for solving vision-language tasks.

Limitations
Like two sides of the same coin, the strengths and weaknesses of PNP-VQA both result from the zerotraining modular system design.PNP-VQA enjoys the power of pretrained models but also inherits the bias from these models.It enjoys the efficiency of zero training, but introduces additional inference cost due to the multi-step process.Nevertheless, we believe that the strengths of PNP-VQA outweigh its limitations, and welcome further investigations to help debias pretrained models and improve inference speed.

A Visualization
In the appendix, we show visualizations of Grad-CAM heatmaps and the generated captions for VQAv2, OK-VQA, and GQA in following pages.

B Hyperparameter sensitivity
We study how VQAv2 validation accuracy varies with different cross-attention layer used for Grad-CAM and number of image patches sampled for question-guided caption generation.Figure 9 (a) shows no clear relationship between VQA accuracy and the cross-attention layer used for GradCAM.The maximum difference in VQA accuracy across different cross-attention layers is 3%.Figure 9(b) shows that VQA accuracy has a negative correlation with the number of sampled image patches.As K ′ increases, the sampled patches become less relevant to the questions, and question-guided patch sampling becomes akin to using all patches.

Figure 1 :
Figure1: The system architecture of PNP-VQA, consisting of three pretrained modules: (1) an image-question matching module that identifies image patches relevant to the question, (2) an image captioning module that generates a diverse set of captions, (3) a question answering module that generates an answer given the question and captions.For the image-question matching module and image captioning module, we adopt BLIP(Li et al., 2022b).For the question answering module, we adopt UnifiedQAv2(Khashabi et al., 2022).
Figure 2: Examples of generic captions (from all patches) based on the original image and question-guided captions (from the sampled patches) based on the GradCAM heatmaps on VQAv2 data.For illustrative purposes, we highlight words in green to indicate correct answer predictions and the cues from captions.Words in red indicate wrong answer predictions.

Figure 3 :
Figure 3: Two methods to process multiple captions with a question answering model.(a) Fusion-in-Encoder (FiE), which concatenates the captions as a long input paragraph to the encoder.(b) Fusion-in-Decoder (FiD), which encodes each caption with the question individually and concatenates all encoded representations as input to the cross-attention mechanism of the decoder.

Figure 5 :
Figure 5: Analysis on the relationships between answer hit rate (AHR), VQA accuracy, and the number of captions per question (N).(a) shows a positive correlation between AHR and VQA accuracy.(b) shows the AHR increases with N, where the proposed question-guided patch sampling produces captions with the highest AHR.
dog that is on the floor with shoes next to it 2. a brown and gray dog sleeping under a table Prediction: black Question-guided captions: 1. a pair of red trainers and a pair of red shoes are shown 2. a red sneakers and red boot and a pair of red shoes are pictured next to Prediction: red Q: what color is the shoe?A: red Generic captions: 1. a slice of pizza on a cutting board with cheese and broccoli 2. a vegetarian pizza with parsley on a marble dinner plate Prediction: parsley Question-guided captions: 1. slices of broccoli and cheese on a pizza pie 2. a picture of some slices of broccoli, broccoli and pizza Prediction: broccoli Q: what is the green stuff on top?A: broccoli Generic captions: 1. a small town with cars parked along a one way street 2. cars and a vehicle at a red light on a corner Prediction: yes Question-guided captions: 1. cars stopped at an intersection on a clear day 2. the clear blue sky above is blue in color and is a clear sky Prediction: no Q: are clouds visible?A: no Generic captions: 1. a view of a train going through a tunnel 2. a train tunnel next to a tree filled forest Prediction: no Question-guided captions: 1. the view of a blurry image of trees and bushes 2. motion picture of a train window with blurred photo Prediction: yes Q: is this picture blurry?A: yes Generic captions: 1. a tennis player is playing tennis on a red court 2. a tennis player getting ready to serve the ball Prediction: red Question-guided captions: 1. a tennis court with a red clay tennis court and white line 2. a woman on a clay tennis tennis court preparing to strike a ball Prediction: white Q: what color is the line on the tennis court?A: white Generic captions: 1. a man walking a black horse on a track 2. a horse with a number 9 in a race track Prediction: 9 Question-guided captions: 1. a jockey is on his horse and numbers are on the number 6 2. a jockey is taking a pony with the name number six eight Prediction: 6 Q: what number is the horse?A: 6 Generic captions: 1. a very tall tower with a little clock on it 2. there is an old clock tower at this town Prediction: the palace Question-guided captions: 1. a white grand theatre, on a bright day 2. the grand store, grand in grand, is seen Prediction: grand Generic captions: 1. two beds in a suite with luggage in a bag on top of them 2. two large beds sitting in a room with suitcases Prediction: no Question-guided captions: 1. three pictures in a frame above two beds 2. a hotel room with 2 double beds and pictures on the wall Prediction: yes Q: what is the name of the theater?A: grand Q: is there any art hanging on the walls?A: yes

Figure 6 :
Figure 6: Examples from VQAv2.We show generic captions (from all patches) based on the original image and question-guided captions (from the sampled patches) based on the GradCAM heatmaps.For illustrative purposes, we highlight words in green to indicate correct answer predictions and the in captions.Words in red indicate wrong answer predictions.
small stuffed bear sits on a bookshelf with shelves full of books 2. a teddy bear that is sitting in front of a bookcase Prediction: bear Question-guided captions: 1. a photograph which looks like a diamond pattern 2. square image surrounded by bookdiamond diamond diamond Prediction: diamond Q: what shape is cut out here?A: diamond Generic captions: 1. a spoon and fork are sitting on a white plate on a wooden table 2. a round cake with cream on it on a plate Prediction: a spoon Question-guided captions: 1. a fork, silverware, fork and a spoon are shown 2. utensil on the plate which seems to have a fork and the fork Predictionplane sits on a runway at the airport 2. a small white airplane on the runway with trees behind Prediction: a small white airplane Question-guided captions: 1. airplane small airplane blue airplane cessna small a airplanes this propeller cessna airplane white private fuselage 2. the small airplane is parked on the runway Prediction: cessna Q: what brand of airplane is shown?A: cessna Generic captions: 1. a man at a table with a box and donuts 2. a man with a hat showing a box with a dozen donuts in it Prediction: donut holiday Question-guided captions: 1. a man is wearing a santa hat eating a kriskin 2. a person with a red hat eating a woman is taking a selfie and taking a selfie 2. a woman is taking a picture in a mirror and taking a picture Prediction: selfie Q: what is the popular name for the type of photo this lady is taking?A: selfie Generic captions: 1. a man in a helmet is standing next to a table 2. a person is standing in a room with a helmet on Prediction: tequila Question-guided captions: 1. a man is wearing a helmet with a coca cola cola cola soda soda coke glass 2. a man with sunglasses waves at the camera Prediction: coca cola Q: what brand of soda is on the bottle?A: coca cola Generic captions: 1. a gray bird is standing on a bench looking out the water with an island in 2. a grey bird in a body of water Prediction: gray bird Question-guided captions: 1. a big grey heron is standing up in the sun 2. a heron bird standing in water with heron bird standing next to him Prediction: heron Q: what species of bird is this?A: heron Generic captions: 1. a silver metal tray that has breakfast on top of it 2. a metal tray holding a couple plates of food Prediction: restaurant Question-guided captions: 1. pancakes pancakes are shown at place on a set at a dinnertable 2. plates of delicious breakfast on a tray of a table Prediction: continental Q: what style breakfast is this?A: continental

Figure 7 :
Figure 7: Examples from OK-VQA.We show generic captions (from all patches) based on the original image and question-guided captions (from the sampled patches) based on the GradCAM heatmaps.For illustrative purposes, we highlight words in green to indicate correct answer predictions and the cues in captions.Words in red indicate wrong answer predictions.

Table 1 :
Comparison of different sampling strategies for image patches.100 question-guided captions surpass the performance of 5 human-written captions from MS COCO.

Table 2 :
(Alayrac et al., 2022)ith state-of-the-art models on zero-shot VQA.Flamingo(Alayrac et al., 2022)inserts additional parameters into the language model and perform training using billion-scale vision-language data.The best accuracy is bolded and the second best is underlined.

Table 3 :
Ablation study on different caption decoding methods.PNP-VQA 3B performs well across the stochastic methods.

Table 4 :
Ablation study on various textual question answering module for PNP-VQA on zero-shot VQA.UnifiedQAv2 is a task-specific model pretrained for question answering.
YanZeng, Xinsong Zhang, and Hang Li. 2022b.Multigrained vision language pre-training: Aligning texts with visual concepts.In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25994-26009.PMLR.