MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.


Introduction
Large Language Models (LLMs) including Chat-GPT (OpenAI, 2022b), LLaMA (Touvron et al., 2023), PaLM2 (Anil et al., 2023), and the recently developed GPT4 (OpenAI, 2022c), have fundamentally transformed the manner in which humans interact with machines.Due to their vast knowledge repositories and chain-of-thought reasoning capability (Wei et al., 2023), these models have proven to be capable of providing answers to a broad range of questions across domains, without the need for training on specific tasks.Nevertheless, these models face two significant challenges.First is the issue of hallucination (Li et al., 2023), attributable to the fact that LLMs store their knowledge in their parameters.Hallucination can seriously hamper 1 Our codebase is available at https://github.com/lezhang7/MOQAGPT.
Where was the movie The Shape of Water made?  the accuracy and reliability of question answering, as it can introduce plausible, yet incorrect information, thus exacerbating the problem.Second, while LLMs are designed to process the text modality only, there are numerous other non-textual sources of information, such as images, videos, audios, and tables, that could provide suitable answers to most real-world questions.Some queries may even require a synthesis of information from across different modalities for accurate responses.Thus, the inability to process non-textual inputs restricts the effectiveness of current LLMs.
Consider an example outlined in fig. 1 where the question asked is, "Where was the movie 'The Shape of Water' made?".To effectively answer this question, a human would employ a divideand-conquer approach.This strategy involves first retrieving relevant documents such as the movie poster, news reports, and table records for The Shape of Water.The individual would then derive the answer from the obtained references that might include terms like Canada, Toronto, or London, Hertfordshire in England.soning would be applied to all potential answers, for instance, recognizing that Toronto is related to Canada.Given this relationship, and the lack of strong ties between London, Hertfordshire in England and other candidate answers, Toronto is deemed the most probable answer and is selected as the final response.
However, many existing models rely on a joint strategy (Liu et al., 2023;Li et al., 2022;Yang et al., 2022;Chen et al., 2022) where they attempt to retrieve and rank all modalities by training a joint embedding space.This approach, despite its application, has several shortcomings.Firstly, the joint strategy lacks flexibility, necessitating retraining when new models and modalities are introduced.Secondly, it poses considerable difficulties to train a joint embedding space and rank references encompassing more than two modalities.Although some divide-and-conquer models have been proposed (Talmor et al., 2021), they also come with limitations.These models require training a question type classifier, followed by the training of various question-answering models.Each of these stages requires a significant amount of annotations, thus presenting a considerable challenge in their development and implementation.
To enable LLMs to solve this task in a zero-shot manner, we propose Multi-modal Open-domain Question Answering GPT (MOQAGPT).MO-QAGPT utilizes a divide-and-conquer approach and employs robust models to extract answers from various modalities.It further leverages LLMs as a reasoning mechanism, applying in-context learning to process the extracted information and generate the final response.Compared with the traditional supervised methods, our framework, as depicted in fig. 2 We corroborate these advantages by presenting experimental results on two multi-modal opendomain question answering (MMOQA) datasets: MMCoQA (Li et al., 2022) and MultModalQA (Talmor et al., 2021).Both datasets require questions to be answered based on information retrieved from text, image, and table references.We conducted experiments using several of the latest models and demonstrated that our method is effective across all of them, highlighting our framework's flexibility.To demonstrate the trustworthiness of our method, we compared our outputs to those produced by directly querying LLMs.Our outputs are less prone to hallucination, making them more trustworthy.Lastly, we examined several success and failure cases of our method.Thanks to the interpretable nature of our framework, we could identify the sources of errors.Overall, our method exhibits robust zero-shot performance on both datasets, underscoring its promising potential.
Our contributions in this paper are threefold: (1) We propose MOQAGPT, a simple and effective framework, which is the first to enable LLMs to tackle multi-modal open-domain queries in a zeroshot setting.(2) We conduct extensive experiments involving multiple LLMs and Vision-Language Models (VLMs), thus validating the effectiveness of our approach.(3) We present empirical evidence that LLMs are capable of efficiently addressing MMOQA tasks when paired with other modalities.Furthermore, we demonstrate that replacing each module with its superior version enhances performance, establishing this as a foundational framework for future zero-shot question-answering systems.

Related Work
Multi-modal Open-domain QA MMOQA represents a challenging yet realistic task that is crucial for all future automated question-answering systems.This task necessitates the retrieval of pertinent references, after which answers are extracted from these references.This often involves complex processes such as modality selection and cross-modal reasoning.In light of this, several datasets have been introduced to benchmark the development of solutions in this area, such as Many-ModalQA (Hannan et al., 2020), HYBRIDQA (Chen et al., 2020), WebQA (Chang et al., 2022), MultiModalQA (Talmor et al., 2021), and MM-CoQA (Li et al., 2022).
Earlier works in this field focus on model training.These include methodologies for joint embedding of multiple modalities (e.g.MAE (Li et al., 2022) and ManyModelQA (Hannan et al., 2020)), the structured knowledge and unified retrievalgeneration based method SKURG (Yang et al., 2022), and the Multimodal Graph Transformer (He and Wang, 2023), which employs a graphbased quasi-attention mechanism for integrating multi-modal graph information.To the best of our knowledge, we are the first to introduce a zero-shot method for multi-modal open-domain question answering, marking a significant contribution.

LLM-based Modular Systems
The development of modular neural networks, rooted in biology and neuroscience, can be traced back to the 1990s (Azam, 2000;Auda and Kamel, 1999).Before the rise of LLMs, modular neural networks like (Andreas et al., 2016(Andreas et al., , 2015) ) aimed to handle compositional tasks.They did this by decomposing them into sub-tasks, utilizing off-the-shelf language parsers, and then learning specialized neural modules for each.However, their applicability was limited, being constrained by parser performance and the need for hand-specified module types.
The emergence of LLMs has renewed interest in this area.LLMs address the parsing challenges without necessitating additional training.Consequently, this led to the proposition of various LLMbased systems targeting an array of compositional reasoning challenges.Examples include: Toolformer (Schick et al., 2023), which trains language models to select tools.Visual ChatGPT (Wu et al., 2023), HuggingGPT (Shen et al., 2023), and Chameleon (Lu et al., 2023), all utilizing GPT to deduce tool sequences for response generation.ViperGPT (Surís et al., 2023) and Visprog (Gupta and Kembhavi, 2022), leveraging Codex and GPT3 (Brown et al., 2020) respectively to produce python programs for visual reasoning tasks.Yet, these methods don't address MMOQA.MMOQA inherently involves multiple steps, making it apt for a modular approach.Our framework, therefore, capitalizes on LLMs to integrate and reason about information retrieved from various modalities for MMOQA.Our approach is distinct as it requires no training, setting it apart from prior works like Schick et al. (2023).It also differentiates itself from research such as Surís et al. (2023); Gupta and Kembhavi (2022); Shen et al. (2023) with its emphasis on retrieval-based question answering.Although Chameleon (Lu et al., 2023) supports both retrieval and question answering, it doesn't address the MMOQA tasks, especially those needing cross-modal information integration and reasoning.Moreover, our method operates in a zeroshot setting, avoiding the need for intermediate programs, unlike Chameleon which requires a fewshot Python intermediate program.

MOQAGPT
MOQAGPT presents a general approach to generate answers to queries using a multi-modal knowledge base collection C, which encompasses text C txt , tables C tab , and images C img .The divideand-conquer strategy is accomplished through a two-stage process.
First, the Multi-modal Question Answer Extraction stage ( §3.1) extracts answer candidates from

Table Retriever
Re-extract answer

GPTs as Reasoner
Final Answer different modalities of the knowledge base independently and utilizes rule-based strategies to sift through these responses.Second, the Answers Infusion stage ( §3.2) employs LLMs' reasoning abilities to integrate information from across various modalities and select the most plausible answer.It is noteworthy that MOQAGPT operates with existing models and requires no additional training, thus exhibiting zero-shot inference capabilities.A comprehensive depiction of this methodology is provided in fig.3, and the prompts used are described in table 1.It's worth noting that we do not perform prompt engineering due to the API cost incurred.

Multi-modal Question Answer Extraction
In this stage, queries are answered independently for each modality, and a strategy is employed to refine the responses.It's important to emphasize that every retrieval modality and question-answering model within our framework is frozen and can be interchanged.We delve into the details of this stage in the subsequent sections.

Retrieval
Different modalities utilize pre-established, highly effective models for retrieval, eliminating the need to map all modalities into a joint embedding space.
(i) For Text Retrieval, we use the ANCE model (Xiong et al., 2020), which adopts a dense retrieval approach to encode both passages and queries.Retrieval is then based on the cosine similarity between these encoded representations.(ii) For Image Retrieval, the renowned CLIP model (Radford et al., 2021) is employed.This zero-shot retriever, trained on web-crawled caption-image pairs using an image-text contrastive loss, has shown im-pressive performance in image-text retrieval benchmarks (Lin et al., 2014;Young et al., 2014).The similarity between queries and images for retrieval is determined through their inner product.(iii) For Table Retrieval, tables are typically converted into a textual format, and then language models are utilized for encoding (Herzig et al., 2020).Following this protocol, we employ OpenAI's Ada (OpenAI, 2022a), a robust embedding model, to encode linearized tables and queries, with similarity measured using the inner product.The retrieval process can be expressed as: where R represents the retrieved references, q is the question, and C is the knowledge collection.

Question Answering
Upon retrieving the references, the next step is to extract an answer from each reference based on the question q. (i) For Visual QA, we use visionlanguage models (VLMs) with robust zero-shot capabilities to generate the responses.In this step, the question is fed into these VLMs using a simple prompt: 'Question: Q Answer:'.(ii) Since table data can be linearized into text, both Textual QA and Tabular QA can be tackled with a single LLM.As we're aiming for extractive question answering, the final response should be a concise text span from the provided input.We direct the LLMs with a specific prompt (refer to Prompt QA in table 1) for this purpose.Moreover, we found that some questions could be addressed directly by prompting LLMs.Hence, Name Prompt Prompt QA You are performing extractive question answering.Given the document: {reference} , extract a short answer to the question: {Q} from the document.If insufficient information is available to answer the question, respond with 'Unknown'.The answer should be one or two words long.
Prompt Direct-QA Question: {questions}.Please provide a concise response, limited to one or two words, No explanation and further question.Answer: Prompt Answer-Fusion Given question {Q}, please select the best answer from the following candidates: {Candidates} Prompt Re-extract Given the question {Q}, please extract the answer span from {final answer}, without providing additional sentences or explanations.The response should be a single word.1).The overall process can be represented as: where r i signifies the i th reference from R.

Rule-based Strategy
At this stage, we possess answer candidates derived from various modalities.Nonetheless, the autoregressive generation of responses by LLMs and VLMs can sometimes produce invalid outputs, such as "sorry, I can't ".Through empirical observation, we identified that: (i) The VLMs tend to consistently produce answers, even if relevant information is missing.(ii) The most accurate answer isn't necessarily found in the top-1 (most similar) retrieved reference.(iii) Using our prompts, LLMs can discern when to provide a specific answer and when to default to "unknown", especially when the available information is insufficient.
With these insights, we crafted a task-agnostic, rule-based strategy to filter out invalid spans and prioritize the most likely answers generated by the LLMs and VLMs: (1) If the direct answer, A direct , is found within any of the sets A img , A txt , A tab , it's deemed reliable and is chosen as the final answer.(2) Any answer containing phrases like "unknown" or "sorry" is discarded.(3) Rather than exclusively relying on the top-1 retrieved reference, we choose the most frequent response from the top-K retrieved references.If all responses are distinct, we opt for the response from the top-1 reference.
These rules are enforced in sequence.If rule 1 isn't satisfied, we will have a curated set of valid answer candidates, denoted as Ã = Ãimg , Ãtxt , Ãtab , Ãdirect .This set will then be used to pinpoint the final answer span by the reasoner, detailed in the subsequent section.

Answer Infusion
For the majority of queries, even though it's hard to decide which modality contains the answer, the format of the answer is usually predetermined.For instance, a question like "What color is the Santa Anita Park logo?" should yield a color as the answer, not a date or a name.Inspired by this observation, we leverage LLMs to infer the correct answer format and select the appropriate answer from the candidates.To achieve this, we designed a prompt (refer to Prompt Answer-Fusion in table 1) that enables LLMs to determine the final answer.As the gold-standard answers are typically short, if the final answer contains more than three words, we guide LLMs to select the correct text span using the Prompt Re-extract.

Experiments & Results
This section is organized as follows.In §4.1, we outline MMOQA datasets, metrics, and baselines.We then discuss retrieval performance in §4.2, question answering in §4.3, and MMOQA results in §4.4.Lastly, we present an ablation study in §4.5 followed by a detailed case study in §4.6.

Implementation details
Dataset and Metrics.We evaluate our method on two MMOQA datasets (refer to table 2 for dataset statistics).Though they share the same references from tables, text, and images, they have different settings, questions, and answers.In all our experiments, we utilize only the top 5 references per modality from the knowledge collection.The MMCoQA dataset (Li et al., 2022) evaluates a model's proficiency in identifying the appropriate answer modality.Each question is uniquely tied to a specific modality.While the dataset employs conversational structures with historical context, we only utilized the gold question for all test set experiments to emphasize a non-conversational approach.
The MultiModalQA dataset (Talmor et al., 2021) is designed for multi-modal comprehension and multi-hop reasoning QA.In alignment with prior studies (Li et al., 2022;Talmor et al., 2021;Yang et al., 2022), we test on the development set (the test set is unlabeled and no online evaluation is available), using Exact Match and F1 as evaluation metrics.The MultiModalQA dataset already provides 1-15 reference candidates for each modality per question.
Baselines To the best of our knowledge, our approach is the first to enable LLMs to perform zeroshot MMOQA.For zero-shot baselines, we select Direct QA by Vicuna, OpenChat, Llama2, Chat-GPT, and GPT4 2 .Additionally, we benchmark our results against supervised methods: (i) For the MMCoQA dataset, we compare with the previous state-of-the-art models, MAE (Li et al., 2022), a joint embedding model trained on MMOQA datasets, and the ManyModelQA model (Hannan et al., 2020).These are the only models reported for this dataset.(ii) For the MultiModalQA dataset, we draw comparisons with the previous SOTA model SKURG (Yang et al., 2022), a structured knowledge and unified retrieval generation-based method, and the Multimodal Graph Transformer (He and Wang, 2023), a model that employs a graph-based quasi-attention mechanism to integrate multi-modal graph information.

Retrieval Results
For the MultiModalQA dataset, previous works directly use the gold reference set without any retrieval.We perform retrieval on the provided reference candidates (1-15 per modality) and find that Recall@5 was consistently 100%.As a result, we present the retrieval results for MMCoQA only.Each question has a designated gold reference modality, and we group questions based on this attribute to report a breakdown of results across modalities in table 3.This evaluation focuses solely on the retrieval of candidates.
The table indicates that our divide-and-conquer approach provides significant benefits in terms of effectiveness compared to the joint method.The prior state-of-the-art methodology, MAE, trains knowledge encoders for tables, images, and textual documents.Once trained, these knowledge encoders are frozen, and a contrastive loss function is employed to train the query encoder.This approach seek to align the embedding spaces of tables, images, and textual documents via the query embedding, without incorporating an actual multimodality alignment.In contrast, our methodology disentangles intricate multimodal knowledge relationships by retrieving each modality independently and then assembling them using LLMs, eliminating the need for complex ranking.It registers a significant improvement, with its Recall@5x3 (51) being close to MAE's Recall@2000 (63.4).

Question Answering Results
We conduct question answering on the retrieved references, obtaining 5 answer candidates for each modality.The primary assessment criterion is to determine if the gold answers were present among these 5 candidates.As such, we use Recall@5 to evaluate the quality of the generated answers.Additionally, Vicuna's outputs often appear to be noisy, lengthy, and non-specific.For instance, for a question with the gold answer "Joss Whedon", Vicuna might produce a response like "1.Joss Whedon 2. David Greenwalt 3. David Boreanaz 4. Unknown", which complicates the extraction of the final answer, even if the recall score is high.This recall Table 4: Question Answering Recall@5.We group questions based on gold reference modality to report results breakdown across modalities as in table 3. Similarly, we group questions answerable by a single modality or those requiring multi-modality.
The overall R@5x3 are calculated based on whether the gold answer is found within concatenated 15 answer candidate score represents the potential maximum performance, as the answers from different modalities could be harmonized through LLM reasoning.
The results in table 4 indicate that VQA is the most challenging task in MMCoQA due to its low recall.In contrast, ChatGPT effectively manages textual question answering for both text and linearized tables.In the context of the MultiModalQA dataset, multi-hop reasoning or cross-modal understanding is frequently required, rendering tasks that involve multiple modalities more demanding than those relying on a single modality.Our empirical observations shows that some questions can be addressed using references from different modalities, achieving an overall Recall@5x3 of nearly 50 across both datasets.

Multimodal Open-domain QA Results
Following the rule-based strategy, valid results are processed with the Prompt-Answer Infusion and subsequently reasoned by the LLM.The results are shown in table 5 and table 6.The supervised methods on the MMCoQA dataset underperform, largely due to the subpar joint retrieval results as highlighted in table 3. When MAE is provided with a gold reference, thereby eliminating the retrieval step, its performance notably improves -witnessing a 30.64 increase in F1 score and a 24.58 rise in EM.This implies that the main challenge lies in the retrieval and ranking of references.On the other hand, the Direct QA methods of LLMs effectively handle questions involving textual references, thanks to their expansive knowledge storage.Yet, they falter for queries demanding image references, primarily due to modality limitations.Our zero-shot method outshines the supervised baseline because of superior retrieval, question answering, and answer infusion capabilities.It also elevates the Direct QA approach when grounded in retrieved results, showing up to a 6.0 F1 and 5.9 EM boost over ChatGPT and a 5.0 F1 and 5.1 EM enhancement over GPT4.Overall, our methodology exhibits a significant improvement across all tested models.The MultiModalQA dataset provides 1-15 references for each modality, diminishing the criticality of retrieval from an extensive multi-modal knowledge base.Thus, our divide and conquer approach might not realize its utmost potential here.Consequently, our zero-shot method trails the supervised baselines.This is in stark contrast to MM-CoQA, where retrieval across modalities is imperative.Such foundational differences underscore the varied baseline results between the two datasets.However, given that our approach operates in a zero-shot fashion, devoid of task-specific annotations and specialized tools like question classifiers, MOQAGPT notably betters zero-shot baselines and closes the performance gap with supervised methods.
As illustrated in table 6, in comparison to Direct QA, our method boosts the overall metrics by 6.0 F1 and 6.8 EM over ChatGPT, and 9.5 F1 and 10.1 EM over GPT4.The most significant leap comes from the Single Modality category, underscoring the efficacy of our approach for onehop tasks.We also register improved scores in the Multi Modality category, showcasing the ability of our GPT to amalgamate different modalities.Predictably, GPT4, employed for direct QA and reasoning, exhibits superior gains than ChatGPT across both Single/Multi Modality categories.This aligns with our hypothesis: given the task's emphasis on cross-modal reasoning, our method leans heavily on robust reasoning capabilities to merge information across modalities.Thus, the more adept the reasoner, the higher the performance.Moreover, GPTs capably filter out noise from Vicuna's output, markedly enhancing the performance against the direct QA by Vicuna for both datasets.
In conclusion, it's essential to underscore that real-world situations more closely mirror the MM-CoQA setup, where evidence isn't readily available but requires retrieval from vast repositories.In such scenarios, the strengths and merits of our method shine through, substantially surpassing supervised methods, heralding broader acceptance and use.

Ablation study
In our pursuit to assess the efficiency of the proposed rule-based strategy, especially its efficacy in noise mitigation, we conduct experiments on MM-CoQA.We utilize InstrucBLIP for VQA, ChatGPT   for TextualQA, and GPT-4 for Direct QA and Reasoning.Detailed findings from these experiments are presented in table 7.
Rule 1 proves to be essential, leading to a 23% reduction in GPT activations.This obviates the need for reasoning over potentially noisy answers, thereby enhancing response accuracy and curtailing inference time.The sensitivity of LLMs to input noise, as underscored in Zhang et al. (2023), reinforces the importance of Rule 2. Excluding this rule introduces detrimental noise during the reasoning stage, adversely affecting the outcomes, as corroborated by ablation studies.Rule 3, which refines response selection by assessing the consensus among top references, is further validated through ablation study.Collectively, these findings cement the role of our rule-based strategy as a pivotal, optimized element, rather than just a rudimentary heuristic.

Conclusion
In this study, we introduce the first zero-shot multimodal open-domain question answering framework, MOQAGPT, which enables LLMs to perform the MMOQA task.This framework is flexible, accommodating new models and modalities without requiring additional training.It stands out for its trustworthiness, being grounded on reliable retrieval results, and its interpretability, which is ensured by transparent intermediate outcomes.By leveraging LLMs and VLMs, our model surpasses supervised methods in MMCoQA performance and significantly narrows the gap between zero-shot and supervised methods in multi-hop Multimodal QA datasets.Furthermore, our results indicate that models without robust knowledge storage capabilities, such as Vicuna, are less suited for this task.We hope that our approach offers some insights and servers as a general and promising framework for multi-modal open-domain question answering.

Limitation
Central to our work is the dependency on Large Language Models (LLMs), particularly the GPT family, which being proprietary, necessitates an API call, incurring both financial and temporal costs to replicate our results.While the datasets used in our studies incurred minimal costs (2$ and 5$), larger datasets like WebQA could demand more3 .The consistent updates to the GPT version imply that results, while not precisely reproducible, should only improve compared to those reported in this paper.Furthermore, we provide results from open-source LLMs, ensuring reproducibility.A Does chain-of-thought help?
CoT reasoning represents an emerging capability within LLMs.Given that we employ LLMs as our Textual QA and Reasoner, it is pertinent to examine if CoT aids in our setup.To address this, we implement a straightforward strategy, adopting a spell prompt that reads: {reference} {question} Let's think step by step.This results in the model generating a step-by-step reasoning response, which contemplates the reference in relation to the question, and evaluates if sufficient information is available for a response.Subsequently, we prompt GPT to extract an answer, considering the question, reasoning process, and reference with the prompt: Reasoning:{reasoning} Question:{question} Give me a very short answer, in one or two words.Upon conducting this process, we observe the following findings: Firstly, CoT proves beneficial for retrieval-based question answering as demonstrated in fig. 5.The technique enables LLMs to better extract potential answers from references, significantly improving recall for table, text, and overall data.However, for Direct QA, all metrics except GPT4's F1 score decrease.This is because CoT focuses on reasoning which isn't necessary for answering general knowledge questions.We've noticed that with CoT, GPT4 tends to generate longer results, thus improving its F1 score.Secondly, it's unexpected that the increased an-swer recall due to CoT does not aid in final answer extraction fig. 5.A possible reason is that CoT extracts a larger quantity of information from the reference material, both useful and irrelevant.It even attempts to provide answers when sufficient information isn't available, leading not only to include correct answers but also incorrect ones in the candidate pool.For instance, in MMCoQA, the average number of valid answer candidates with CoT is 3, compared to 2.5 without it.This addition of noise could confuse GPT4, impairing its ability to make the correct decision.

C Detailed Example
Our methodology is detailed in table 9 table 10 and table 11, showcasing the retrieval results, questionanswering process, strategy outputs, and final answer fusion.For these examples, BLIP2 serves as the VQA model, ChatGPT as the textual QA model, and GPT4 as the reasoner and direct QA model.The retrieval results across all modalities are rational, despite several 'Unknown' instances, which are filtered out through the strategy.While the final answers are expected to be correct, they do not meet the current exact match criteria.The veracity of the results is established by grounded answer sources.For instance, in table 10, while GPT's answer is ungrounded and incorrect, our methodology provides accurate information enabling the LLM to select the correct choice.
Question: What is the competition of Gtu trttarfelag with match against Rangers?Gold reference modality: table; Answer: uefa champion leagu

Figure 1 :
Figure 1: An illustration of how human adopt divideand-conquer strategy to answer multimodal opendomain question

Figure 2 :
Figure 2: Comparison of two paradigmns for multimodal open-domain question answering.Fire symbol indicates modules require training, ice symbol indicates forzen models.
fig. 2, has three main advantages: Flexibility: The MOQAGPT operates in zero-shot mode without relying on joint representation or inference, which allows easy replacement of individual modules with more advanced ones as they become available.Furthermore, it can accommodate a wider range of modalities and eliminates the need to curate a multimodal open-domain dataset for training.Trustworthiness: The framework's responses are based on the retrieved results.Consequently, each answer can be traced back to its original source, thus making the model more trustworthy and reducing the risk of hallucination.Interpretability: All intermediate outputs, including the retrieval results, candidate answers, and the final reasoning for answer synthesis, are produced in natural language, rendering the process of answering open-domain questions transparent and interpretable.

Figure 3 :
Figure 3: Overview of MOQAGPT.Snow symbol indicates the model is frozen.

Fig 4
Fig 4 presents various instances of model performance.Questions 1-4 show failures: Question 1 shows the string matching metric failing to process similar meanings.Question 2 illustrates the model's inability to choose the correct answer from the candidate list.Question 3 and 4 highlight the proposal of incorrect candidates and the existence of a knowledge bias in the model, respectively.Conversely, Questions 5-8 exemplify successes: Question 5 shows hallucination errors being rectified via grounded retrieval-based answers.Ques-

Table 1 :
Prompts used in MOQAGPT

Table 3 :
Retrieval results for MMCoQA † represents quoted results.Joint NDCG are computed for 2000 items.Questions are classified into categories based on the gold reference modality, scores are computed for each modality independently.Overall result is computed on concatenated references from all modalities, which can be viewed as Recall@5x3.
2 Details are described in Appendix appendix B.

Table 5 :
Results on MMCoQA † represents quoted results.VQA represent models to extract answers from image, Textual QA represents models to extract answer from text and linearlized table.Direct QA & Reasoner represents model which is used to directly ask for and model to infuse information and reasoning and generate final answer

Table 7 :
Ablations on rule-based strategy

Table 8 :
CoT Ablation Results with Different Reasoners VQA model is InstructBLIP and Textual QA is ChatGPT

Table 9 :
# Question Answering PromptYou are performing extractive question answering.Given the document: {reference} , extract a short answer to the question: question from the document.If insufficient information is available to answer the question, respond with 'Unknown'.The answer should be one or two words long.# Image Answer wuppertaler, woman football, league cup, league cup, league cup # Text Answer Unknown., Unknown., Unknown., Unknown., Unknown.# Tabble Answer Unknown., Unknown., Unknown., Unknown., Unknown.Example1, detailed results of MoqaGPT solve the task, note that there are repeated images exist in the dataset