Analyzing Modular Approaches for Visual Question Decomposition

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.


Introduction
End-to-end neural networks (Li et al., 2023) have been the predominant solution for visionlanguage tasks, like Visual Question Answering (VQA) (Goyal et al., 2017).However, these methods suffer from a lack of interpretability and generalization capabilities.Instead, modular (or neurosymbolic) approaches (Andreas et al., 2015;Johnson et al., 2017;Hu et al., 2017;Yi et al., 2018) have been long suggested as effective solutions which address both of these limitations.These methods synthesize symbolic programs that are easily interpretable and can be executed (leveraging distinct image or language processing modules) to solve the task at hand.The most recent such models (Gupta and Kembhavi, 2023;Surís et al., 2023;Subramanian et al., 2023) are training-free: they leverage large language models (LLMs) to generate programs and subsume powerful neural networks as modules.Such approaches demonstrate strong results and outperform end-to-end neural networks on zero-shot vision-language tasks.
These recent modular approaches typically include state-of-the-art end-to-end networks, among a complex schema of other modules and engineering designs.As a result, the contribution of these networks is difficult to disentangle from the modularity of their overall system.Thus, in this paper, we analyze ViperGPT, in which BLIP-2 (Li et al., 2023) is a constituent module, as a representative example of a recent and performant modular system for vision-language tasks.BLIP-2 can particularly (and in contrast from ViperGPT's other modules) solve VQA tasks on its own.We ask: where does its additional performance come from, and how much is due to the underlying BLIP-2 model vs. the additional symbolic components?To answer our research questions, we conduct a controlled study, comparing end-to-end, modular, and prompting-based approaches (Sec.2) on several VQA benchmarks (Sec.3).We make the following specific contributions: Each setting receives an image and question as input and produces a prediction as output (at which time it will terminate).Similar colors across models in this diagram indeed refer to the same modules.
tains a significant percentage of ViperGPT's task-agnostic performance (i.e.84% for the direct answer setting and 87% for the multiple choice setting).And, retaining only the BLIP-2 module comprises 95% and 122% of ViperGPT's task-agnostic performance in those settings.
2. In Section 5, we find that a prompting-based (rather than code-based) method for question decomposition still constitutes 92% of the performance of an equivalent ViperGPT variant for direct answer benchmarks.Moreover, this method actually exceeds ViperGPT by +12% on average for multiple choice benchmarks.(To the best of our knowledge, our prompting-based method also presents the highest score in the multiple choice setting of A-OKVQA (Schwenk et al., 2022) compared to any other training-free method.)These results suggest that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.
3. In Section 6, we explore ViperGPT's generalization to out-of-domain benchmarks.Unlike for in-domain datasets, we find that providing task-specific in-context examples actually leads to a performance drop by 11% for A-OKVQA's direct answer setting and 2% on average for A-OKVQA and ScienceQA multiple choice settings.We additionally ana-lyze the code that is generated by ViperGPT and observe higher runtime error rates for A-OKVQA's direct answer setting (3%) and the multiple choice benchmarks (12-18%) than the in-domain direct answer benchmarks (1-2%).Finally, while the syntax error rate is 0% for all direct answer benchmarks, it is 1-3% for multiple choice benchmarks.

Models
In this section, we share specific design decisions and implementation details for each of the three model families we assess in this paper (i.e.end-toend, modular, and prompting-based).We additionally visualize these approaches in Fig. 1.

End-to-end
As the end-to-end model in our analyses, we use BLIP-2 (Li et al., 2023), an open-source stateof-the-art vision-language model which can be used for image captioning and zero-shot VQA.For VQA, this model first encodes images using a pretrained image encoder and projects the resulting encoding into the input space of a pre-trained language model.That language model is then used to generate a textual prediction, given the aforementioned image projection and VQA question.
As in Li et al. (2023), we prompt this model with "Question: {} Short answer: []".For the direct answer setting, we generate text directly from the language model.For the multiple choice setting, we select the choice with the maximum log likelihood for text generation.

Modular
In this paper, we use ViperGPT (Surís et al., 2023), which is a recent modular system for visionlanguage tasks.
ViperGPT prompts a language model with a VQA question and an API-which is an interface for manipulating images-to generate a Python program.This API is written in code and describes a class ImagePatch and several functions, like .find,.simple_query,etc.These functions invoke both symbolic algorithms (e.g. for iterating through elements, sorting lists, computing Euclidean distances, etc.) and trained neural network modules (for object detection, VQA, LLM queries, etc.).When the Python program is executed, these functions should manipulate the VQA image and answer the question.
As a simple example, the question "How many black cats are in the image?"might be written as: The program can then be executed using the Python interpreter and several modules.In this case, ImagePatch.find(object_name: str) -> list[ImagePatch] utilizes an object detection module to find all cats and ImagePatch.verify_property(object_name:str, property: str) -> bool utilizes textimage similarity for determining whether those cats are black.
Implementation.Our implementation of ViperGPT uses the code 1 released with the original 1 https://github.com/cvlab-columbia/viperViperGPT paper (Surís et al., 2023).However, as that codebase currently2 differs from the original paper in several ways, we have modified it to re-align it with the original report.Specifically, we switch the code-generation model from ChatGPT to Codex, revert the module set and prompt text to those in Surís et al. (2023, Appendix A-B), and otherwise make minor corrections to the behavior and execution of ImagePatch and execute_command.However, we find that only the full API prompt is made available-not the task-specific prompts-preventing us from exactly replicating Surís et al. (2023).
For fairness in comparing with the model in Sec.2.3, we make a few additional design decisions that deviate from Surís et al. (2023).For our task-agnostic variant (Sec.4), we use the full ImagePatch API and external functions (excluding the VideoSegment class) in our prompt.We specify further modifications for our other variants in Sec. 4. We prompt the model with the following signature: "def execute_command(image) -> str:" (i.e.we explicitly add "-> str" to better conform to the VQA task).During multiple choice, Surís et al. (2023) provides another argument, possible_choices, to the signature.However, we extend this argument with an explicit list of these choices.
For the direct answer setting, we use the text as returned by the program as our predicted answer.This text may be generated by a number of modules (e.g.BLIP-2 or InstructGPT) or as a hardcoded string in the program itself.For the multiple choice setting, the returned text is not guaranteed to match a choice in the provided list.So, we map the returned text to the nearest choice by prompting In-structGPT (text-davinci-003) with "Choices: {} Candidate: {} Most similar choice: []".We select the choice with the highest log likelihood.
We elaborate further on our design choices and how they make our experiments more fair in Appendix B.

Successive Prompting
Building programs is not the only way to decompose problems.Recent work in NLP has found that large language models improve performance at reasoning tasks when solving problems stepby-step (Wei et al., 2022;Kojima et al., 2022).For question answering, decomposing a question and answering one sub-question at a time leads to further improvements (Press et al., 2022;Zhou et al., 2023;Dua et al., 2022;Khot et al., 2023).Moreover, recent work has started to invoke visionlangauge models based on the outputs of langauge models.
As a convergence of these directions, we introduce a training-free method that jointly and successively prompts an LLM (InstructGPT: text-davinci-002) and VLM (BLIP-2) to decompose visual questions in natural language.We call this "Successive Prompting" (following Dua et al. (2022)).At each step, our method uses the LLM to ask one follow-up question at a time.Each follow-up question is answered independently by the vision-language model.In the subsequent step, the LLM uses all prior follow-up questions and answers to generate the next follow-up question.After some number of steps (as decided by the LLM), the LLM should stop proposing follow-up questions and will instead provide an answer to the original question.We constrain the LLM's behavior by appending the more likely prefix of "Follow-up:" or "Answer to the original question:" (i.e. the stopping criteria) to the prompt at the end of each step.
In order to prompt a large language model for this task, we provide a high-level instruction along with three dataset-specific in-context demonstrations of visual question decompositions.
Our method generates text directly, which can be used for the direct answer setting.Like with ViperGPT, we also prompt our method with an explicit list of choices during the multiple choice setting.And, for the multiple choice setting, we select the choice with the highest log likelihood as the predicted answer.

Evaluation
We evaluate variants of end-to-end, modular, and prompt-based methods (Sec.2) on a set of VQA benchmarks (Sec.3.1) using direct answer and multiple choice evaluation metrics (Secs.3.2 and 3.3).
These datasets vary in the amount of perception, compositionality, knowledge, and reasoning their problems require.More specifically: VQAv2 is a longstanding benchmark whose questions require primitive computer vision skills (e.g.classification, counting, etc).GQA focuses on compositional questions and various reasoning skills.OK-VQA requires "outside knowledge" about many categories of objects and usually entails detecting an object and asking for knowledge about that object.A-OKVQA features "open-domain" questions that might also require some kind of commonsense, visual, or physical reasoning.ScienceQA features scientific questions (of elementary through high school difficulty) that require both background knowledge and multiple steps of reasoning to solve.We elaborate further in Appendix D.

Metrics: Direct Answer
We evaluate the direct answer setting for VQAv2, GQA, OK-VQA, and A-OKVQA.In this setting, a method will predict a textual answer given an image and question.We report scores using (1) the existing metric for each dataset and (2) the new InstructGPT-eval metric from (Kamalloo et al., 2023).
We observe that while the general trends (determining which models perform better or worse) remain the same between metrics (1) and ( 2), the actual gap may differ significantly.See Appendix A for further discussion of why ( 2) is a more robust measure of model performance for our experiments.We include (1) for posterity in Tables 1 and 2, but make comparisons in our text using (2), unless specified otherwise.
(2) InstructGPT-eval.Kamalloo et al. (2023) find that lexical matching metrics for open-domain question answering tasks in NLP perform poorly for predictions generated by large language models.We make similar observations for the existing direct answer metrics in the VQA datasets we benchmark: such scores correlate quite poorly with our intuitions for open-ended text generated by language models.For example, the prediction "riding a horse" would be marked incorrect when ground truth answers are variants of "horseback riding".Instead, Kamalloo et al. (2023) suggest an evaluation metric (InstructGPT-eval) that prompts InstructGPT (text-davinci-003) (Ouyang et al., 2022) 2023) demonstrates a substantial increase of +0.52 in Kendall's τ correlation with human judgements using their introduced metric instead of exact match on the NQ-Open benchmark (Lee et al., 2019).

Metrics: Multiple Choice
We evaluate the multiple choice setting for A-OKVQA and ScienceQA.In this setting, a method will similarly be given an image and question, but also a list of textual choices.The method is required to select one of those choices as its predicted answer.We evaluate this setting using the standard accuracy metric.

Selecting Modules in ViperGPT
In Surís et al. (2023), the choice of modules is different for each task.This contrasts with end-toend models like BLIP-2 which are purported to be task-agnostic.To draw a more direct comparison, we evaluate ViperGPT's performance when given the full API (Surís et al., 2023, Appendix B) and set of all corresponding modules.We refer to this as the "task-agnostic" setting (Table 1).We find that, in this case, the gain of ViperGPT over BLIP-2 is reduced from +6.2% to +2.1% on GQA and +11.1% to -3.6% on OK-VQA (using the existing metrics). 4We continue to observe that our task-agnostic ViperGPT variant usually does not perform better than BLIP-2 across benchmarks and metrics, with the exception of the multiple choice setting of A-OKVQA, on which ViperGPT does outperform BLIP-2 significantly.
Since ViperGPT relies on BLIP-2 as one of its modules (i.e. in simple_query for simple visual queries), we wonder how much influence BLIP-2 has in the ViperGPT framework.Moreover, how much does ViperGPT gain from having modules and functions in addition to BLIP-2?
Accordingly, we run two ablations: we evaluate the performance of ViperGPT without BLIP-2 and with only BLIP-2 (i.e. with no other modules).We also report these evaluations in Table 1.
To do so, we modify the full API prompt provided to ViperGPT.For "only BLIP-2", we delete all modules and functions in the prompt besides ImagePatch.simple_query.As the prompt for this module included in-context demonstrations relying on other (now removed) modules, we had to re-write these demonstrations.We either rewrite the existing problem ("zero-shot") or rewrite three random training set examples for each dataset ("fewshot").For "without BLIP-2", we simply delete ImagePatch.simple_query and all references to it from the prompt.We show examples for both procedures in Appendix C.
Because ViperGPT has no other image-to-text modules, we expect that excluding BLIP-2 (i.e."without BLIP-2") should have a highly detrimental effect on the VQA performance of ViperGPT.However, we instead observe that the variant retains 84% and 87% of the average performance, respectively, for the direct answer and multiple choice benchmarks.This indicates that including many modules improves the robustness of the ViperGPT model, in that ViperGPT is able to compensate by using other modules to replace BLIP-2.
We find that using Viper with BLIP-2 as the only module (i.e."only BLIP-2") also retains significant performance in the direct answer setting: i.e. by 95% on average.Moreover, this variant actually gains performance (+6% on A-OKVQA and +12% on ScienceQA) in the multiple choice setting.This For each direct answer entry, we list both existing metrics (left) and the InstructGPT-eval metric (right) described in Sec.3.2.As explained in Appendix A, we only include existing metrics for posterity and make comparisons in our text using the InstructGPT-eval metric.We run BLIP-2 using the same inference settings as Surís et al. (2023), which differ slightly from Li et al. (2023).
result seems to indicate that the BLIP-2 module is doing most of the heavy-lifting within ViperGPT for the VQA benchmarks.
5 Decomposing Problems with Programs or Prompting?
One of ViperGPT's major contributions is that it decomposes problems into Python programs: it inherently gains the compositionality and logical reasoning that is built into programming languages.However, recent work in NLP suggests that questions can also be iteratively decomposed and solved more effectively than end-to-end approaches using step-by-step natural langauge prompting (Press et al., 2022).Here, we measure the gains related to ViperGPT's choice of building logical, executable programs in Python, rather than by using the interface of natural language and reasoning implicitly within LLMs.We want to enable as direct a comparison as possible between natural language prompting with our method and program generation with Viper.Thus, we choose the same VLM (BLIP-2) as ViperGPT and an analogous LLM to Codex (code-davinci-002)-specifically, we use InstructGPT (text-davinci-002).We present the results of our method ("Successive Prompting") in Table 2 and Fig. 2 and directly compare against the "only BLIP-2" variants of ViperGPT.We have also used the same in-context examples for each dataset for both ViperGPT ("only BLIP-2, few-shot") and Successive Prompting, which helps keep the comparison more fair.
Our prompting method performs comparably (i.e.retaining 92% of the performance on average) to ViperGPT on GQA, OK-VQA, and A-OKVQA in the direct answer setting, and is noticably better (i.e.+4% and +17%) on the multiple choice setting for A-OKVQA and ScienceQA.To the best of our knowledge, our method actually presents the highest A-OKVQA multiple-choice score compared to any other training-free method.
Our method presents intermediate results that are in the form of natural language expressions and strictly subject to downstream operations by neural networks.On the other hand, ViperGPT can present Pythonic data types, like lists and numbers, as well as image regions.Unlike our prompting method, ViperGPT does result in a more diverse set of intermediate representations, some of which can be symbolically manipulated, and is designed to leverage a diverse set of neural networks.
But from this experiment, we determine that it is not strictly necessary to decompose problems using programs in order to realize performance gains.Instead, natural language prompting can offer a simpler alternative.While ViperGPT leverages the intrinsic compositonality and logical execution of programming languages, our method uses conditional generation on intermediate results and a flexible natural language interface for reasoning, while remaining similarly effective.
In Appendix G, we tried to identify patterns in questions to determine whether they were more suitable for formal or natural language-based decomposition.We could not find any clear patterns, following simple question type breakdowns of the original datasets, but are hopeful that future work will explore this further and reveal better insights.6 How well does ViperGPT generalize to out-of-distribution tasks?
As we have observed in Sec. 4, ViperGPT has been designed around a specific set of tasks (including GQA and OK-VQA), especially in its selection of modules and prompt.On the other hand, a core motivation for modular and neuro-symbolic approaches is that these should have better generalization capabilities to unseen tasks.So, we further wonder how robust ViperGPT is to out-ofdistribution tasks.In particular, we consider A-OKVQA and (especially) ScienceQA as out-ofdistribution (compared to GQA and OK-VQA).First, we investigate changes to the prompt of ViperGPT.Will adding task-specific in-context examples improve the model's robustness to new tasks?In Table 1, we compare zero-shot and fewshot variants of "ViperGPT (only BLIP-2)".We can see that including few-shot examples consistently improves performance on the "in-domain" tasks (VQAv2, GQA, and OK-VQA) by +2% on average.But, this consistently hurts performance on the "out-of-distribution" tasks (A-OKVQA and ScienceQA) by 11% on A-OKVQA's direct answer setting and 2% on average for their multiple choice settings.
We also look at the correctness of the programs generated by ViperGPT in Table 3.We find that the generated code is (on average) 3x as likely to encounter runtime errors for A-OKVQA compared to the other benchmarks in the direct answer setting.We find that this rate increases by another 3x (i.e.12%) for A-OKVQA in the multiple choice setting.And, ScienceQA holds the highest rate of runtime failures overall at 18%.In the multiple choice setting, A-OKVQA and ScienceQA produce code that cannot be parsed (i.e. with syntax errors) 1% and 3% of the time.On the other hand, the rate of parsing exceptions is consistently 0% for benchmarks (including A-OKVQA) in the direct answer setting.

Related Work
Visual Question Answering.Visual Question Answering is a common vision-language task with many variants: in our paper, we benchmark several mainstream VQA datasets (Goyal et al., 2017;Hudson and Manning, 2019;Marino et al., 2019;Schwenk et al., 2022;Lu et al., 2022), requiring a broad range of skills: related to computer vision and perception, compositional understanding, outside and world knowledge, scientific and commonsense reasoning, and more.
End-to-end models.End-to-end models are the predominant approach in deep learning.In particular, we consider vision-language models here.We observe that recent, state-of-art vision-language models may rely on large-scale pretraining (Radford et al., 2021;Singh et al., 2022;Yu et al., 2022;Wang et al., 2023;Li et al., 2023), be trained on Figure 2: Examples of decompositions for our "ViperGPT (only BLIP, few-shot)" and "Successive Prompting" models on our direct answer benchmarks.These examples have been condensed for readability.In Successive Prompting, follow-up questions (Q) are proposed by the LLM (InstructGPT) and answered (A) by BLIP-2.After a variable number of follow-ups, the LLM will provide a prediction to the original question and terminate.
many tasks with a unified architecture (Chen et al., 2021b;Wang et al., 2022b;Lu et al., 2023), or both (Wang et al., 2022a;Alayrac et al., 2022;Chen et al., 2022).In this paper, we focus on the first category and find BLIP-2 (Li et al., 2023), which utilizes a frozen image encoder and language model, as a suitable and effective representative.
Neuro-symbolic methods.Several prior methods have attempted neuro-symbolic approaches for visual reasoning tasks (Andreas et al., 2015;Johnson et al., 2017;Hu et al., 2017;Yi et al., 2018).However, until now, these methods were learned by training and found middling success in doing so.More recently, a few training-free approaches have been suggested that leverage the powerful in-context and program generation abilities of modern large language models (Gupta and Kembhavi, 2023;Surís et al., 2023;Subramanian et al., 2023).Simultaneously, these approaches adopt SOTA neural networks in their set of modules.All together, these methods present the first competitive neurosymbolic solutions for visual reasoning tasks.
Natural language reasoning and tool-use.Recently, it has been found in NLP that large language models can more effectively perform reasoning tasks by reasoning step-by-step (Wei et al., 2022;Kojima et al., 2022).A few works have extended a similar capability in LLMs for complex problem and question decomposition (Press et al., 2022;Zhou et al., 2023;Dua et al., 2022;Khot et al., 2023).Finally, recent works have learned ways to prompt language models to call tools (Parisi et al., 2022;Schick et al., 2023).All of these emergent research directions jointly enable the prompting approach we present in Sec. 5.

Conclusion
In this paper, we have analyzed ViperGPT (Surís et al., 2023), a recent and intricate modular approach for vision-language tasks.We unbox ViperGPT, asking the research question: where does its performance come from?Observing that one of ViperGPT's five modules (i.e.BLIP-2 (Li et al., 2023)) often outperforms or constitutes a majority of its own performance, we investigate this module's role further.Through our experiments, we find that while ViperGPT's marginal gains over BLIP-2 are a direct result of its task-specific module selection, its modularity improves its overall robustness and it can perform well even without the (seemingly critical) BLIP-2 module.Additionally, we investigate ViperGPT's choice of generating Python programs.We ask if this is necessary and, alternatively, propose a method relying on prompting large language models and vision-language models to instead decompose visual questions with natural language.Our method performs comparably to the relevant ViperGPT variants and, to the best of our knowledge, even reports the highest multiple choice accuracy on A-OKVQA (Schwenk et al., 2022) by any training-free method to date.

Limitations
Although the experiments in our paper are selfcontained and designed to be directly comparable with each other, the absolute scores we report differ from other reports.For example, our results for BLIP-2 on GQA and OK-VQA are within 2-5% of the original reports.We attribute such differences to possible differences in model inference settings between (Surís et al., 2023)-which we follow and e.g.runs the model with 8-bit inference-and (Li et al., 2023).
In our "ViperGPT (only BLIP-2)" experiments, we find that the method often calls BLIP-2 directly in a single step and might not offer any mechanisms beyond BLIP-2 in that case.

A Existing Metrics vs. InstructGPT-eval
We observe that existing VQA metrics correlate poorly with (our) human judgments for evaluating model predictions.We analyze predictions for our model types (i.e."BLIP-2", "ViperGPT (taskagnostic)", and "Successive" in Tables 1 and 2) on random training split subsets (N = 50) of VQAv2, GQA, and A-OKVQA.We specifically find that, when the existing and new (InstructGPT-eval) evaluation metrics disagree, InstructGPT-eval is correct 93% of the time.In Fig. 3, we show two examples of such disagreements between existing VQA metrics and the open-ended metric.Therefore, we include the results of existing metrics in our paper for posterity, but do not find these reliable (especially for open-ended text generated by models like InstructGPT).We instead make comparisons in our paper using the InstructGPT-eval metric.We observe that trends (i.e. which model performs better) are usually the same for both metrics, but the actual gaps may differ significantly.

B ViperGPT Design Choices
We make a few modifications (listed in Sec.2.2) to the ViperGPT method (from the original design (Surís et al., 2023)) to improve conformity for the VQA task and ensure fairness when comparing to our prompting-based approach in Sec.2.3.We elaborate further here.
As we always expect the executable program to return a string for VQA, we explicitly add "-> str" to the function signature in the prompt.By design, our prompting-based approach can similarly only result in a string.
For the multiple choice setting, we provide an explicit list of choices in the code-generation prompt (e.g."# possible answers : ['dog', 'cat', 'foo', 'bar']" after the question prompt).We do this, so the code generation model benefits from awareness of these choices when generating the program.Similarly, our prompting-based method is provided with a list of choices in conjunction with the question, prior to proposing follow-up questions.
Unlike the Successive Prompting method, the ViperGPT program is not guaranteed to produce a result that matches one of the multiple choices.So we map this result to the most similar choice using InstructGPT (text-davinci-003).We make this choice because text-davinci-003 is already used by a module in ViperGPT and for the open-ended evaluation metric.(2023).In all, InstructGPT-eval correctly evaluates the predictions.As we mention in Sec.3.2, we find evaluations from InstructGPT-eval more accurate in an overwhelming majority of cases.For example, "What color are the pants?" might entail (1) detecting the pants in the image and (2) determining their color.

C ViperGPT Variants
2. GQA: This is a benchmark that focuses on compositional questions.Requires an "array of reasoning skills such as object and attribute recognition, transitive relation tracking, spatial reasoning, logical inference and comparisons" (Hudson and Manning, 2019).For example, "What color are the cups to the left of the tray on top of the table?" is a multi-step composition (focusing on spatial relationships and attribute recognition).
3. OK-VQA: Requires "outside knowledge" about many categories of objects.Usually requires detecting an object and asking for knowledge about that object.Example (Surís et al., 2023, Figure 5): "The real live version of this toy does what in the winter?".Involves locating and identifying the toy, then asking about the rest.

A-OKVQA:
A follow-up benchmark to OK-VQA.Instead of asking for closed-domain knowledge about objects, this features "opendomain" questions that might also require some kind of commonsense, visual, or physical reasoning.For example, "Which position will the red jacket most likely finish in?" in-volves (1) identifying the context (a ski race), (2) locating all the racers, (3) identifying the racer who is wearing the red jacket, (4) determining the orientation of the race (e.g.left-toright), and (5) determining the "index" of the red jacket racer among all racers along this orientation.The question's textual prior appears contextually insufficient and proposing a program based on this alone (as in ViperGPT) could be fragile and quite difficult.
5. ScienceQA: This benchmark features scientific questions (of elementary through high school difficulty) that require both background knowledge and multiple steps of reasoning to solve.Their example question is "Which type of force from the baby's hand opens the cabinet door?" (choices: push, pull).The given reasoning is that (paraphrased) "The direction of push is away from and pull is towards the acting object.The baby's hand applies a force to the cabinet door that causes the door to open.The direction of the door opening is towards the baby, so the force is pull."Without seeing the image, it is not apparent, but what determines whether the baby is opening or closing the door is the fine-grained detail that the baby's hand is curled over the top of the cabinet door, not grasping the handle or pushing the door's surface.The current stage of visual programming models are not capable of such difficult multi-step reasoning chains and planning around such fine details.Instead, we find that ViperGPT tends to default to its end-to-end VQA module instead.

E Log likelihood of generating continuations
We use a weighted byte-length normalization for generating the log likelihood of a continuation, i.e.
where x is a list of tokens (with m tokens in the prompt and n tokens in the continuation) and the byte-length of the token x i is L x i .

F Runtime Failure Rate of ViperGPT
As an extension to

H Successive Prompting Example
Question: Has the food this woman is preparing been fried?Follow-up: What's in the image?Follow-up answer: a person is preparing a salad on the counter Follow-up: Has the lettuce been fried?Follow-up answer: no Answer to the original question: no Figure 1: A diagram of the end-to-end, modular, and prompting-based models (Sec.2) we explore in this paper.Each setting receives an image and question as input and produces a prediction as output (at which time it will terminate).Similar colors across models in this diagram indeed refer to the same modules.
the food this woman is preparing been fried?Q: What's around the window?Q: Which part of this animal would be in use if it was playing the game that would be played with the items the man is holding?Q: The company that produced the device in her hand is from what country?ViperGPT (only BLIP-2, few-shot) of this {animal} would be in use if it was playing the game that would be played with the items the man is holding?"

Figure 3 :
Figure 3: Two examples of disagreements between existing VQA metrics and InstructGPT-eval from Kamalloo et al.(2023).In all, InstructGPT-eval correctly evaluates the predictions.As we mention in Sec.3.2, we find evaluations from InstructGPT-eval more accurate in an overwhelming majority of cases.

Table 1 :
Our evaluation of BLIP-2 and our ViperGPT variants across VQA benchmarks.

Table 3 :
A breakdown failure rates across exception modes for ViperGPT (task-agnostic) across our benchmarks."No Exception" only indicates completion, not correctness, of executions.Parsing errors occur as SyntaxError in Python.We further breakdown runtime errors in Appendix F.

Table 4 :
Breakdown of failure rates across runtime exceptions for ViperGPT (task-agnostic) across our benchmarks.

Table 5 :
Table 3, we further breakdown ViperGPT failures for runtime errors in Table 4.We breakdown failure rates for our three model families by question types (as specified in the original GQA, OK-VQA, and ScienceQA datasets).We remove outliers (i.e.categories with < 50 samples) and sort by |ViperGPT − Successive|.