Zero-shot Visual Question Answering with Language Model Feedback

In this paper, we propose a novel language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA). Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM). As the major contribution, we leverage the guidance and feedback of the prediction model to improve the capability of the captioning model. In this way, the captioning model can become aware of the task goal and information need from the PLM. To develop our approach, we design two specific training stages, where the first stage adapts the captioning model to the prediction model (selecting more suitable caption propositions for training) and the second stage tunes the captioning model according to the task goal (learning from feedback of the PLM). Extensive experiments demonstrate the effectiveness of the proposed approach on the knowledge-based VQA task. Specifically, on the challenging A-OKVQA dataset, LAMOC outperforms several competitive zero-shot methods and even achieves comparable results to a fine-tuned VLP model. Our code is publicly available at https://github.com/RUCAIBox/LAMOC.


Introduction
Recently, pre-trained language models (PLMs) (Devlin et al., 2019;Brown et al., 2020), especially large language models (Zhao et al., 2023) have demonstrated excellent capabilities in solving tasks that require background knowledge or complex reasoning, such as commonsense reasoning (Sap et al., 2019;Rajani et al., 2019) and logical reasoning (Wei et al., 2022;Kojima et al., 2022).Inspired by these successes, recent studies have proposed utilizing PLMs 1 to solve complex vision-language Corresponding author. 1 In this paper, PLMs refer to the models trained on textonly corpus, instead of the text encoder/decoder in visionlanguage pre-trained (VLP) models, which typically have a weaker reasoning capacity in linguistic content.

Captioning Model
A person taking a picture of cupcakes on a camera.

PLM
What is the white substance on top of the cupcakes?
Biscuit tasks, exemplified by the task of knowledge-based visual question answering (VQA) that aims to answer open-ended questions given an image based on outside knowledge (Schwenk et al., 2022).It has been shown that PLM-enhanced approaches (Gui et al., 2022;Lin et al., 2022) typically lead to an improved performance on the knowledge-based VQA task than pure vision-language pre-trained (VLP) models (Schwenk et al., 2022).
In the literature, existing PLM-enhanced VQA approaches can be roughly categorized into two lines.The first line of research focuses on adapting PLMs to the vision modality by introducing specific modular networks or training objectives (Tsimpoukelli et al., 2021;Liang et al., 2022;Alayrac et al., 2022).However, they usually incur a high computational cost during pre-training in order to effectively integrate a vision encoder into the PLM.As another line of research, several studies aim to reduce the cost of tuning PLMs in visionlanguage tasks by utilizing PLMs in a zero-shot or few-shot manner.They typically generate a caption for an image using a captioning model (e.g., a fine-tuned VLP model), and employ the generated caption as the context (e.g., prompt) to assist PLMs in question answering (Yang et al., 2022;Tiong et al., 2022;Guo et al., 2022).Such an approach is training-free and can be generally applied with various PLMs.However, in these existing zero-shot or few-shot methods, the captioning model is unaware of both task goal and information need for the integrated PLM.They directly reuse the captioning model fine-tuned on caption datasets.As a result, the generated captions tend to be less informative for the VQA task, even irrelevant to the question.Figure 1 presents an example that an inappropriate caption leads to an incorrect answer generated by the PLM.As we can see, the question is highly related to keywords "icing" or "frosting", while the captioning model misses these information and generates a general description.
To address this issue, we propose LAMOC: a novel LAnguage MOdel guided Captioning approach for the VQA task.The key idea is to leverage the guidance and feedback of the prediction model (i.e., the PLM) to improve the capability of the captioning model, so that it can be aware of the task goal and information need, and assist the prediction model in answer prediction.Our approach is specially designed with two gradual training stages.At the first stage, the captioning model is trained to align to the prediction model, in which the prediction model selects captions that are more pertinent to a given question from multiple propositions generated by the captioning model.These selected captions are informative and can be used to fine-tune the captioning model to generate informative captions.At the second stage, since the generated caption is used by the PLM as direct evidence for VQA, we employ the feedback from the PLM as reward signals to train the captioning model via reinforcement learning.During training, only the captioning model is tuned while the PLM is fixed, which significantly reduces the computational costs.Meanwhile, since the feedback is from PLM, both training stages do not require any labeled data.
Our contributions can be summarized as follows: (1) We propose LAMOC, a novel approach for training captioning models to generate informative captions that can assist PLMs in VQA tasks; (2) Using a small number of randomly sampled unlabeled (image, question) pairs, LAMOC consistently outperforms several competitive zero/few-shot baselines without PLM feedback on two knowledgebased VQA datasets: OK-VQA and A-OKVQA; (3) We have demonstrated the effectiveness of our method on PLMs of varying scales, from 223M to 11B.This not only confirms the robustness of our approach but also demonstrates its potential for generalization to Large Language Models (LLMs).

Related Work
PLMs for VQA.After training on large corpora, PLMs exhibit surprising abilities, such as chainof-thought reasoning (Wei et al., 2022), in-context learning (Brown et al., 2020), and instruction following (Chung et al., 2022), which cannot be obtained by vision-language pre-training.Thus, some works adopt PLM to perform VQA and obtain promising results.One line of research combines a PLM and a vision encoder and trains them endto-end.Frozen (Tsimpoukelli et al., 2021) and Liang et al. (2022) train a visual encoder or a modular network and keep the PLM frozen to retain its powerful abilities.Flamingo (Alayrac et al., 2022) elaborates the model architecture to combine the vision and language models and scales the model size to 80B.Another line of research tries to deploy PLMs on VQA tasks in a few-shot/zero-shot manner.PICa (Yang et al., 2022) and Img2Prompt (Guo et al., 2022) translate the image to captions or tags and employ GPT-3 to answer a question by incontext learning.PNP-VQA (Tiong et al., 2022) generates question-related captions and utilizes a QA model (Khashabi et al., 2022) for answer prediction.This type of work does not require extra training and can be adapted to new PLMs.Our work follows the second paradigm and is an extension of these works.
Learning from Feedback.A regular paradigm to train a model is defining a loss function and optimizing it.However, certain objectives, such as coherence, diversity, and toxicity in text generation, may not be easily incorporated into the loss function and learned in an end-to-end manner (Paulus et al., 2018;Pang and He, 2021).Thus, explicit feedback on model output is regarded as a learning signal to assist in training.Campos and Shern (2022) utilize a PLM's refinement and human feedback to fine-tune a summary model.Wang et al. (2022c) leverage compiler feedback to improve the compilability of programs generated by the language model.Ouyang et al. (2022) align a language model with the user's intention through reinforcement learning from human feedback.We borrow idea from these works, but our feedback comes from a PLM instead of humans, thus saving the annotation cost.

Method
In this section, we present the proposed LAMOC: LAnguage MOdel guided Captioning method for VQA.The overall architecture of LAMOC is depicted in Figure 2.

Overview of Our Approach
In this work, we study the task of visual question answering (VQA).Given an image-question pair x : ⟨x i , x q ⟩, the task goal is to predict a correct answer y to the question x q given the image x i .Following prior studies (Yang et al., 2022;Tiong et al., 2022), we adopt a captioning-based approach for VQA, in which a captioning model generates auxiliary captions for helping answer prediction.Formally, we represent the above idea in a probabilistic way: p(y|x i , x q ) (1) , where a captioning model Θ C firstly generates an auxiliary captions z, and then a prediction model Θ P predicts an answer candidate y based on the caption z and the question x q .We evaluate this probability by iterating over a set of generated captions.Here, we consider an unsupervised setting: no labeled answer data is available.Although there is no labeled answers, we assume that a small number of image-question pairs can be obtained for training (no overlapping with the task dataset).
To instantiate this probabilistic approach, we adopt a vision-language pre-trained (VLP) model, i.e., BLIP (Li et al., 2022b), as the captioning model, and a pre-trained language model (PLM), i.e., FLAN-T5-XXL (Chung et al., 2022), as the prediction model.The prediction model Θ P is expected to fulfill the task by accurately predicting the answer, while the captioning model Θ C plays an assisted role by providing informative evidence for Θ P .In our approach, the captioning model Θ C can be tuned while the prediction model Θ P is fixed during optimization.By leveraging the unlabeled image-question pairs (without the labeled answers), we let the two models cooperate with each other: the captioning model generates informative evidence for helping answer prediction, and the prediction model provides task-specific guidance and feedback to improve the captioning model.
To optimize our approach, we design a gradual training process including two stages: (1) caption-ing adaptation aims to adjust Θ C to produce informative captions that are suitable for Θ P ( §3.2.1), and (2) feedback-based learning aims to optimize Θ C according to task-specific feedback from Θ P ( §3.2.2).Once the captioning model is well trained, we employ the prediction model for predicting the final answer as in Eq. ( 1), based on the captions provided by the captioning model ( §3.3).Next, we introduce these parts in details.

Language Model Guided Captioning
The key of our approach (Eq.( 1)) is to train an effective captioning model Θ C for improving the capability of the prediction model Θ P on VQA.Considering that there are no labeled answers, we employ the prediction model to provide guidance and feedback to optimize the captioning model.

Captioning Adaptation
Since the captioning model is originally intended to describe the given image, it may not be in suited form to assist the prediction model.Thus, we propose a captioning adaptation strategy that tunes the captioning model to fit the prediction model.
Caption Propositions.We first sample n imagequestion pairs from VQAv2 (Goyal et al., 2017), which is a large VQA dataset containing more than 1M questions and does not overlap with our task dataset.Then we employ the captioning model to propose k captions for each image by nucleus sampling (Holtzman et al., 2019).Among these captions, some may be better suited for the prediction model than the rest.We would like to identify such captions and use them to refine the captioning model.

Captioning Model
Question: What region of the world is this?
Answer: Asia more useful for answer prediction.Thus, we keep the captions with the predicted option "D:75%" and discard the rest.
Captioning Model Fine-tuning.Via the above caption selection, we can obtain a set of more informative captions, which are judged by the prediction model.Further, we use them to fine-tune the captioning model by optimizing the following cross-entropy loss: where T is the length of caption, z t denotes the t-th token of the informative caption selected by FLAN-T5-XXL, z <t represents the generated token up to the (t−1)-th step.After fine-tuning, the captioning model can be better suited for the prediction model.

Feedback-based Learning
Though adapting to the prediction model, the captioning model is still unaware of the answer prediction task for VQA.Thus, we further propose construct pseudo supervision signals based on the PLM feedback from the prediction model.Since the captioning model is only involved as an intermediate component for answer prediction, we design a reinforcement learning method for optimizing it.
Reward From PLM Feedback.A key design consideration of reinforcement learning is the definition of the reward function.In our approach, instead of only generating relevant captions for the images, the effectiveness of the captioning model should be measured by how well it helps find the correct answer.To achieve this goal, we design the following two kinds of reward signals.
• Prompt-based Reward: A heuristic method is utilizing the prompt in §3.2.1 to instruct FLAN-T5-XXL to obtain a relevance score, and regard this relevance score as the reward signal: A higher score indicates a more informative caption, which is encouraged.
• Confidence-based Reward: Since there is no ground-truth answer during training, following Eq.(1), we employ the probability score of the predicted answer (the most confident candidate) given by the prediction model as the reward: where z is the generated caption by the captioning model and ŷ is the predicted answer from the prediction model.In this way, the PLM (i.e., the prediction model) can inform the captioning model about the informativeness of the generated caption: the larger probability score, the more informative a caption is, and vice versa.We will verify the reliability of these reward designs in §5.1.
Policy Gradient.In the framework of reinforcement learning, caption generation can be viewed as a sequential decision-making process over the whole vocabulary space.Each generated caption with T tokens is treated as an individual episode of length T in this process.At the t-th time step, the state (x i , z <t ) is the combination of the image and caption generated up to the (t − 1)-th token, and the action z t is the t-th token to be generated.We employ the policy gradient algorithm (Sutton and Barto, 2018) and perform gradient descent to optimize the following objective function: (5) where z = ⟨z 1 , ..., z t , ..., z T ⟩ is the caption, and r(x q , z) is the reward given by the PLM.Finally, we jointly optimize the two loss functions: where α is a weight factor to balance the two parts.
To fully exploit the online feedback provided by FLAN-T5-XXL, we only optimize the captioning adaptation loss function L F T in the initial epoch, while the reinforcement learning loss function L RL is optimized throughout the training process.

Answer Prediction
At inference time, we utilize the updated captioning model to assist the prediction model in answering questions, by calculating the probability p(y|x q , z; Θ P ).To increase the diversity of captions and the coverage of answers, we first randomly sample 20% patches from the whole image at each time and apply top-k sampling (Fan et al., 2018) to generate a caption for these patches with the updated captioning model.We repeat this process m times to generate m diverse captions.Then we concatenate each of them with the corresponding question to construct the following prompt: "Please answer the following question.\n[CAPTION].[QUESTION]".Based on this prompt, the FLAN-T5-XXL is instructed to propose an answer with greedy decoding.We can take the max-voting strategy over all the generated answers.
Different from previous work on learning from feedback (Campos and Shern, 2022;Wang et al., 2022c;Ouyang et al., 2022), our proposed approach explores the guidance and feedback from the prediction model instead of human annotations.As we will see in §5.1, our empirical study shows that there exists a negative correlation between the negative log likelihood assigned by a PLM and the VQA score of a generated answer.This finding suggests that the reward r(x q , z) given by PLM can potentially serve as a substitute for labeled data to improve the captioning model for the VQA task.

Experiment
This section shows the experimental setup and then highlights the main conclusions of our results.

Experimental Setup
Task Datasets.Since our goal is to improve the performance of PLMs on visual commonsense tasks, we choose two knowledge-based VQA datasets to evaluate our method: (1) OK-VQA (Marino et al., 2019) contains 5,046 questions in the test set that require external knowledge resources to answer.
(2) A-OKVQA (Schwenk et al., 2022) is an augmented dataset based on OK-VQA, which requires additional types of world knowledge compared to OK-VQA.Since the test set of A-OKVQA is not public, we evaluate our method on the validation set.We do not test on VQAv2 (Goyal et al., 2017) because the majority of questions in this dataset are largely focused on recognition and simple visual detection tasks, which can be done without much logical reasoning or external knowledge, and a fine-tuned VLP model could obtain surprising results (Wang et al., 2022b,a).We do not use training data to make a fair comparison with other methods.
Baselines.We divide previous methods into two categories: (1) Methods without extra largescale Vision-Language (V-L) pre-training, which means the models have not been pre-trained on large-scale V-L datasets, including PICa (Yang et al., 2022), PNP-VQA (Tiong et al., 2022), Img2Prompt (Guo et al., 2022).LAMOC also belongs to this category.(2) Methods with extra large-scale V-L pre-training, which means that the PLM and the vision encoder are jointly trained on V-L datasets (although the PLM may be fixed, it obtains the ability to understand images), including VL-T5 (Cho et al., 2021) 2022), VLKD (Dai et al., 2022), Frozen (Tsimpoukelli et al., 2021), and Flamingo (Alayrac et al., 2022).The above methods do not use or use few labeled data (zero-shot/few-shot). Besides, we include two methods, i.e., BLIP (Li et al., 2022b) and PromptCap (Hu et al., 2022), which are fine-tuned on large amounts of labeled data.
Implementation details.For image captioning, we adopt BLIP (Li et al., 2022b) with 446M parameters and load the released checkpoint that has been fine-tuned on the COCO 2014 training set (Lin et al., 2014), which has no overlap with both the OK-VQA and A-OKVQA evaluation datasets.For the PLM, we utilize FLAN-T5-XXL (Wei et al., 2022), which has been fine-tuned on more than 1,800 tasks through instructions and stores considerable world knowledge.We also carry out experiments on PLMs with other sizes, from 223M to 11B parameters, to demonstrate the robustness and generalizability of our approach across PLMs with different sizes.It is noteworthy that the informative caption dataset used in the captioning adaptation stage is selected by FLAN-T5-XXL, because the relevance score given by smaller models is not reliable, as will be illustrated in §5.1.When training the captioning model, we select 1,000 (im-age, question) pairs without labels from VQAv2 (about 10% of the amount of training data for our target datasets), which has no overlap with the OK-VQA and A-OKVQA.It is worth noting that these 1,000 image-question pairs can be sampled from any datasets or even be generated, we sample from VQAv2 for the sake of reproducibility.The answers are generated by the PLM auto-regressively, without access to the pre-defined answer list.We conduct experiments with 5 random seeds and report the average VQA score according to official evaluation protocols.

Main Results
Table 1 displays the results of our methods and baselines on OK-VQA and A-OKVQA.First, LAMOC outperforms all the zero-shot baselines without V-L pre-training on both datasets.Compared to previous state-of-the-art, LAMOC achieves prominent gains on the challenging A-OKVQA dataset (37.9 vs 36.0) and OK-VQA dataset (40.3 vs 39.9).Compared to these baselines, our approach does not require additional image-question matching or question generation modules, thus speeding up the inference speed.Since Flamingo has been trained on a massive V-L dataset, it achieves the best performance among zero-shot methods.It has been reported that largescale V-L pre-training can develop a mapping between images and knowledge concepts that can aid in knowledge-based VQA (Tiong et al., 2022).
Second, LAMOC narrows the gap between methods with and without fine-tuning, and even achieves comparable results with the fine-tuned VLP model, i.e., BLIP.For example, the performance gap between PNP-VQA 11B and BLIP is 2.5, and has been decreased to 0.6 by LAMOC, which implies the importance of language model feedback.
Finally, we report the results of our methods with different model sizes in Table 2.When increasing the model scale from 223M to 11B, we observe a 1-2 point improvement in VQA scores on the challenging A-OKVQA dataset.This indicates that a larger PLM can not only store more world knowledge to assist with question answering, but also provide more accurate feedback to refine the captioning model.This is further supported by the ablation study in §5.1.The main idea of our work is leveraging the feedback of a PLM to guide caption generation, so a critical aspect is the reliability of the feedback.LAMOC involves two types of feedback: (1) prompt-based reward and (2) confidence-based reward, which will be evaluated independently.

Analysis
To evaluate the reliability of the first type of feedback, we analyze the relation between the VQA score and the relevance score provided by the PLM on A-OKVQA validation set (Figure 3(a)).We can observe that as the relevance score provided by FLAN-T5-XXL increases, the VQA score also increases, indicating that FLAN-T5-XXL is a suitable prediction model for providing accurate feed-back and the relevance scores can be regarded as reward signals.However, this trend is not observed for the other three models, implying that their feedback is unreliable.As a result, we only use FLAN-T5-XXL to select informative captions during captioning adaptation.
To evaluate the reliability of the second type of feedback, we prompt FLAN-T5 to answer the question conditioned on the captions and plot the relationship between the negative log-likelihood (NLL) of the generated answer and its corresponding VQA score.As Figure 3(b) shows, there is a negative correlation between the NLL of the generated answers and their VQA scores, suggesting that captions with lower NLL are more informative and relevant to the questions.Therefore, the probability of the generated answer is a reliable feedback and can be used as the reward signal during reinforcement learning.

The Effectiveness of Two-stage Training
When training the captioning model, we adopt two gradual training stages: captioning adaptation and feedback-based learning.In this part, we study the effectiveness of this training strategy and explore whether one training stage is more effective than the other.As illustrated in Table 2, different models benefit from different training objectives.For example, the captioning adaptation stage is more beneficial for FLAN-T5-large, leading to an improvement of about 4 points on OK-VQA.On the other hand, FLAN-T5-XXL benefits the most from reinforcement learning with prompt-based rewards and obtains more than 4 points improvement on A-OKVQA.Moreover, the results show that jointly training the two objectives further boosts performance, highlighting the effectiveness of the proposed two-stage training approach.Since LAMOC is trained on the basis of BLIP, the difference can reflect the effect of our method.As can be observed, the captions generated by LAMOC are longer and more comprehensive, containing key information relevant to the question.For example, in Figure 4(a), LAMOC generates captions that include specific details such as "frosting" and "chocolate", while BLIP only generates general captions about "donuts" and "box", without sufficient infor-  mation to help answer the question.These results highlight the importance of training the captioning model under the guidance of PLMs.One concern is that the PLM may generate correct answers due to the language bias, not attributing to the relevant information contained in the captions.For example, in Figure 4(a), the PLM may generate the answer "chocolate", even if the captions do not mention chocolate (Li et al., 2023).However, since chocolate often co-occurs with donuts in the training corpora, the PLM may associate chocolate with donuts and generate it as the answer.In order to check how often such a situation happens, we randomly sample 100 questions where the prediction model gives correct answers.For each question, we manually assess whether their answer is derived from the caption.Our analysis reveals that only 6 out of 100 captions are irrelevant to the questions, indicating the reliability of the captions.

Case Study
Another interesting phenomenon is that the sentences generated by LAMOC can be grammatically incoherent and sometimes incomplete.This indicates that PLM prompting may not always conform to human language patterns, which is consistent with previous studies (Webson and Pavlick, 2022;Deng et al., 2022).
The ablation study of the level of relevance, the number of captions, and the influence of different prompt designs can be found in appendix B.

Conclusion
In this paper, we propose LAMOC, a language model guided captioning method that improves a captioning model to generate comprehensive captions for an image to help answer the question.In order to train such a model, we first perform captioning adaptation on a self-generated dataset filtered by FLAN-T5-XXL, and then fine-tune the updated captioning model through reinforcement learning from PLM feedback.Our method, LAMOC, generates captions that are both informative and able to assist PLMs in VQA tasks, as demonstrated through experiments on two knowledge-based VQA datasets.On the challenging A-OKVQA dataset, LAMOC substantially outperforms previous zero-shot methods and even achieves comparable results to a fine-tuned VLP model.Additionally, we show that LAMOC is generalizable to PLMs of varying sizes, from 223M to 11B parameters, demonstrating its potential to be applied to LLMs, which we leave as future work.

Limitations
In our study, we have demonstrated the effectiveness of our proposed method on FLAN-T5 with different sizes.However, we have not yet evaluated its performance on LLMs, which possess an even greater number of parameters and have been pre-trained on larger corpora, thus potentially providing more accurate feedback for both caption adaptation and reinforcement learning.Meanwhile, it is worth noting that PLMs may contain certain biases, and training based on their feedback may amplify these biases.As future work, we aim to investigate the scalability of our method to LLMs, as well as strategies to mitigate the potential negative effects of biases present in PLMs.

B.3 Prompt Design
Another critical design of our method is instructing the FLAN-T5 to provide feedback and answer questions, so we explore the effects of different formats of instruction in Table 4.We can observe that prompt design has a great impact on the results Table 4, which is in line with the conclusion of previous works (Wei et al., 2022).

Figure 1 :
Figure 1: An example that a captioning model (BLIP) fails to provide suitable descriptions for a prediction model (FLAN-T5) of a question in A-OKVQA dataset.

Filter
-based Captions Selection.Since the prediction model is developed based on the FLAN-T5-XXL, it has encoded a large amount of knowledge in a massive number of parameters.We design the following instruction to prompt FLAN-T5-XXL to identify more informative captions: "Question: [QUESTION] Caption: [CAPTION]\n To what degree does the caption relate to the question:\n A: 0%\n B: 25%\n C: 50%\n D:75%".Given the above prompt, FLAN-T5-XXL will generate a corresponding option among the set {A, B, C, D}.Such an option reflects the correlation between the caption and question, and the captions with the predicted option "D:75%" are more relevant to the question.Since the options are made by the prediction model itself, they tend to be 1.The indian community of some people… 2. Many people covered by white cloth… 3. Men dressed in uniforms carrying white Sheets under their hats…Captioning Model1.A boy with a beanie is playing soccer.2. A young man with a ball in a green field.3. A young man in a blue uniform kicks the soccer ball on the field.…

Figure 2 :
Figure 2: Overview of our proposed approach LAMOC.In captioning adaption, we utilize a PLM to select informative captions and fine-tune the captioning model on them.When learning from PLM feedback, we regard the feedback from the PLM as reward signals and perform reinforcement learning on the captioning model.

Figure 3 :
Figure 3: The relationship between the caption's reward and the corresponding answer's VQA score on A-OKVQA validation set. Figure (a) reflects the reliability of prompt-based reward while figure (b) reflects the reliability of confidence-based reward.

Figure 4
Figure 4 displays three instances of the captions generated by BLIP and LAMOC, along with the corresponding answers generated by FLAN-T5-XXL.Since LAMOC is trained on the basis of BLIP, the difference can reflect the effect of our method.As can be observed, the captions generated by LAMOC are longer and more comprehensive, containing key information relevant to the question.For example, in Figure4(a), LAMOC generates captions that include specific details such as "frosting" and "chocolate", while BLIP only generates general captions about "donuts" and "box", without sufficient infor-

Figure 4 :
Figure 4: Example captions and predictions generated by BLIP and LAMOC.

Figure 5 :
Figure 5: VQA score with different number of captions in the A-OKVQA validation set.

Table 1 :
, FewVLM (Jin et al., Results on OK-VQA and A-OKVQA.The methods are categorized by whether they use extra PLM and whether carry out V-L pre-training.The methods in the upper part have been fine-tuned on the training set, while those in the middle and bottom parts have not.All methods using extra PLM keep it frozen.
† Instead of first fine-tuning BLIP on VQAv2 and then performing task-specific fine-tuning, we directly fine-tune BLIP on two target datasets for a fair comparison.

Table 2 :
Results of different model sizes and different training objectives."BLIP caption" means feeding captions generated by BLIP to PLM without captioning adaptation and feedback-based learning."adaptation" means captioning adaptation, while "RL (prompt)" means RL with prompt-based reward, and "RL (confidence)" means RL with confidence-based reward.
Q: What kind of coating has been used?A: frosting, chocolate, icing, paint BLIP captions: 1. there are a box of assorted donuts in it.2. a box of seven different type of doughnuts.BLIP answer prediction: sugar ❌ LAMOC captions: 1. box of chocolate and sprinkle glazed donuts each with assorted sprinkled sugar, vanilla, and chocolate frosting.2. a box of assorted chocolate, pink, chocolate frosted doughnuts.LAMOC answer prediction: chocolate frosting, chocolate ✅ Q:

What is the stuffed animal touching instead of the tennis player? A: television, television screen BLIP
captions: 1. a teddy bear in a pair of scissors and a tennis racket.2. a stuffed teddy bear wearing a tennis uniform.BLIP answer prediction : a tennis ball ❌ LAMOC captions: 1. man in business clothes swinging tennis racquet at television player on glass background.2. a man on television wearing the foot of a tennis player.LAMOC answer prediction: television ✅ Q:

Table 4 :
VQA score of the answers generated by FLAN-T5-large conditioned on different prompts.