Retrieval Augmented Visual Question Answering with Outside Knowledge

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation, introducing a potential limit on the overall system performance.Instead, we propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion. Our experiments show that our scheme outperforms recent OK-VQA systems with strong DPR for retrieval. We also introduce new diagnostic metrics to analyze how retrieval and generation interact. The strong retrieval ability of our model significantly reduces the number of retrieved documents needed in training, yielding significant benefits in answer quality and computation required for training.


Introduction
Visual Question Answering (VQA) is a challenging problem that lies at the intersection of Computer Vision, Natural Language Processing, and Information Retrieval. The objective in VQA is to read an image and provide an answer to an accompanying question about the image content. Current approaches to VQA employ deep-learning-based systems to jointly understand images and text.
VQA is particularly challenging when the answer to the question is not directly available in the image. In Knowledge-based VQA (KB-VQA), the VQA system must access external knowledge sources to find a correct and complete answer. The Ouside-Knowledge VQA task (OK-VQA) (Marino et al., 2019) consists of questions that requires general knowledge and simple inference to answer (Fig. 1). Such questions are even hard for humans.
Unlike other KB-VQA datasets (e.g. FVQA (Wang et al., 2017)) which provide an associated knowledge base, OK-VQA encourages using any outside knowledge in answering questions.
Question : Which sesame street character would eat this? Answer: cookie monster Figure 1: OK-VQA contains questions whose answer cannot be found within the image.
The need to adapt and refresh knowledge sources motivates the study of KB-VQA systems that can extract knowledge from both structured (e.g. Con-ceptNet (Speer et al., 2017)) and unstructured knowledge representations (e.g. Wikipedia passages). Recent designs (Luo et al., 2021;Gao et al., 2022) approach VQA in two distinct steps: (1) Knowledge Retrieval extracts documents from a large knowledge base; (2) Answer Generation produces an answer from these documents. Knowledge Retrieval can be done via Dense Passage Retrieval (DPR) (Karpukhin et al., 2020), which consists of a question encoder and a document encoder (both Transformer-based) that encode questions and documents into separate dense representations. The DPR system is trained to assign higher scores to documents intended to be helpful in answering questions, so that document sets can be retrieved and passed to Answer Generation.
Knowledge Retrieval based on DPR is powerful but has some readily observed limitations, particularly in model training. Firstly, whether a retrieved document is useful in answering a question cannot be easily determined, even if an answer is provided. Prior work (Qu et al., 2021;Luo et al., 2021) has addressed this problem using "Pseudo Relevance Labels" which are based on whether a document contains a given answer. However, these are only a weak signal of potential relevance and may encourage DPR to retrieve misleading documents. Secondly, the document retriever and answer generator are trained separately. To ensure that the answer generator sees relevant documents in training, systems can retrieve large numbers of documents (∼50+) (Gao et al., 2022;Gui et al., 2021), but at the cost of slower training and more GPU usage, and also possibly presenting misleading material to the answer generator.
Joint training of the retriever and answer generator offers a solution to these problems. The aim is twofold: (1) to improve the retrieval of documents truly relevant to providing a given answer; and (2) to reject documents with pseudo relevance but not actual relevance.
Retrieval Augmented Generation (RAG) (Lewis et al., 2020) has shown that end-to-end joint training of a DPR-based QA system can outperform baseline two-step systems. A notable feature of RAG is a loss function that incorporates marginalized likelihoods over retrieved documents such that the training score of a document is increased whenever it improves prediction.
However, in preliminary OK-VQA experiments we found that RAG did not perform well. Our investigations found that a good portion of OK-VQA training questions are answerable in closed-book form (i.e. using pre-trained models such as T5 (Raffel et al., 2020)) with information extracted only from the image, with the unintended consequence that the RAG loss function awards credit to documents that did not actually contribute to answering a question. We also found that difficult questions that are unanswerable with the knowledge available to retrieval were more prevalent in OK-VQA than in the Open QA datasets (e.g. Natural Questions (Kwiatkowski et al., 2019)) on which RAG was developed. In both of these scenarios, the RAG loss function leads to counter-intuitive adjustments to the document scores used in training the retrieval model, leading to decreased VQA performance.
Motivated by these findings, we propose a novel neural-retrieval-in-the-loop framework for joint training of the retriever and the answer generator. We formulate a loss function that avoids sending misleading signals to the retrieval model in the presence of irrelevant documents. This formalism combines both pseudo relevance labels and model predictions to refine document scores in training. We find significantly better performance on OK-VQA compared to RAG. In this paper: • We present a novel joint training framework Retrieval Augmented Visual Question Answering (RA-VQA) for Knowledge Retrieval and Answer Generation that improves over RAG and two-step baseline systems based on DPR (Karpukhin et al., 2020). • We investigate visually grounded features transformed into 'language space' and assess their contribution to OK-VQA performance. • We study the role of document retrieval in KB-VQA and evaluate its interaction with retrieval-augmented generation. We also show that retrieval becomes more efficient in joint training, requiring retrieval of relatively few (∼ 5) documents in training.
Knowledge-based VQA Systems. KB-VQA can access both structured data, such as ConceptNet and other KGs (Narasimhan et al., 2018;Garderes et al., 2020;Li et al., 2020a;Wu et al., 2021;Marino et al., 2021), as well as unstructured data such as Wikipedia passages (Wu et al., 2021;Gao et al., 2022;Gui et al., 2021). A variety of multimodal approaches have been explored to access external knowledge. ConceptBERT (Garderes et al., 2020) uses attention to aggregate graph node embeddings from ConceptNet. KRISP (Marino et al., 2021) uses a "symbolic knowledge module" to match ConceptNet KG entities with language/visual elements in questions. MAVEx (Wu et al., 2021) uses multiple information sources (Google Images, Wikipedia sentences, and Con-ceptNet) to validate promising answer candidates. VRR (Luo et al., 2021) uses Google Search in a retriever-reader pipeline to perform open-ended answer generation. We also note unpublished contemporaneous work on OK-VQA. TRiG (Gao et al., 2022) shows that it is feasible to transform images into textual features for VQA. The features used are similar to those presented here, although without an emphasis on the role of knowledge retrieval. PICa  'prompts' GPT-3 with descriptive captions generated from images, and KAT (Gui et al., 2021) exploits an ensemble of DPR, T5, and GPT-3 to improve OK-VQA performance.

Vision-to-Language Transformation
Prior work has established that images can be transformed into text such that large pre-trained language-based Transformers (e.g. BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and T5) can be applied to VQA tasks (Luo et al., 2021;. Systems can be based on straightforward image caption, but we have found improvements by introducing additional visually-grounded features. In RA-VQA, each image is represented by visual objects and their attributes, image caption, and any text strings detected within the image. We use an object detection model VinVL (Zhang et al., 2021) that was pre-trained on large object detection datasets to extract visual elements and their attributes (e.g. color and material).
Formally, for an image I we use VinVL to extract a set of visual objects {o i }, along with a set of text attributes for each visual object {a i,j }. Visual objects and their attributes are extracted by VinVL at confidence thresholds 0.8 and 0.6, respectively.
Image captioning is performed to extract relationships and interactions among visual elements such as "a woman holding a knife cuts a cake". The pre-trained captioning model Oscar+ (Zhang et al., 2021) is applied to process visual features extracted from the VinVL model to generate a caption for the image. To answer questions related to text strings in images (e.g. "which language is the book written in?"), Google OCR (Optical Character Recognition) APIs are used to extract text strings from each image.
Hence, a VQA training set {(I, q, S)}, where S is a set of answers to a question q about I, can be transformed into a text-only training set T = {(x, S)} that we use for RA-VQA. The string x contains all the text features extracted from the image (the question, the textual attributes for each identified visual object, the generated caption, and any OCR'd text), with special tokens marking the start and end of each type of feature (Fig. 2).

Weakly-supervised Dense Passage Retrieval
Dense Passage Retrieval in RA-VQA consists of a query encoder F q and a document encoder F d , both as Transformer-like encoders. The aim is to retrieve K documents from an external knowledge database Z = {z i } N d i=1 (e.g. Wikipedia passages) that are expected to be useful for answering a question. DPR encodes questions and documents separately into dense feature vectors F q (x) ∈ R h and F d (z) ∈ R h . A scoring function is used to retrieve documents for each question as the inner product between the representations of x and z RA-VQA training aims to maximize r(x, z) when document z is relevant to answering the question. As discussed in Sec. 1, the relevance between q and z cannot be easily obtained and "pseudo relevance labels" serve as a proxy. We use a pseudo relevance function H(z, S) which is 1 if z contains an answer in S (by string match), and 0 otherwise. For each question-answer pair (x, S) one positive document z + (x) is extracted for training. Inbatch negative sampling is used: all documents in a training batch other than z + (x) are considered to be negative for (x, S) (Karpukhin et al., 2020). Denoting the negative documents as N (x, S) and (3) Training the retriever p θ and the answer generator p φ together using our proposed RA-VQA loss. (4) The answer with highest joint probability the score of the positive document as r + (x) leads to the DPR loss L DP R : (2) 3.3 RA-VQA: Joint Training of Document Retrieval and Answer Generation Given a full query string x extracted from the image-question pair (I, q), DPR returns the K highest scoring documents {z k } K k=1 . The score assigned by the document retriever p θ (·|x) to a retrieved document is Open-ended answer generation for each retrieved document z k is performed with a generative model, such as T5, with parameters φ: For each document z k retrieved for a training item (x, S), we select the most popular human response s * k from S such that s * k is contained in z k ; in the case that z k does not contain any answer, the most popular answer s * ∈ S is selected s * k = s * . We identify two subsets of the retrieved documents {z k } K k=1 based on pseudo relevance labels and model predictions: P + are indices of pseudo relevant documents that also help the model generate popular answers whereas P − identifies documents not expected to benefit answer generation. Joint training of retrieval and answer generation is achieved with a loss L RA−V QA that reflects both model predictions and pseudo relevance: The first term in the loss improves answer generation from queries and retrieved documents, taken together. The remaining terms affect document retrieval: the second term encourages retrieval of documents that are not only pseudo relevant but also lead to production of correct answers, while the third term works to remove irrelevant items from the top ranked retrieved documents. Retrieval and generation complement each other in training: pseudo relevance labels and model predictions provide positive and negative signals to improve retrieval, and the improved retrieval leads to improved answer generation by training towards s * k , a customized generation target for each retrieved document z k .

RA-VQA Generation
Given an image query (I, q), a full query x is created (Sec. 3.1) and answer generation searches for the answer with the highest joint probability: Answers reflect both generation and retrieval models and retrieval confidence plays a strong role, unlike some prior work such as Luo et al. (2021).

Pre-Computed FAISS Document Indices
Since repeated computation of embeddings for all documents is costly, we follow Lewis et al. (2020) who find that it is enough to train only the question encoder F q and leave document encoder F d fixed. As shown in Fig. 2 Pre-training: We start with pre-trained versions of BERT-base and T5-large as the document retriever and the answer generator, respectively. The retriever was refined by training it on GS-full under the DPR loss (Equation 2) with pseudo relevance labels released by Luo et al. (2021). The already strong retriever serves as a good starting point for all DPR-based models presented in this paper (including RA-VQA and our replication of baselines in the literature).
OK-VQA Fine-tuning: Our RA-VQA framework trains the answer generator and the retriever jointly under Equation 6.
We also report on variants of RA-VQA, to investigate the contribution of various model components to overall performance: RA-VQA-NoDPR omits retrieval entirely so that answers are generated by the fine-tuned T5 alone. RA-VQA generation in Equation 7 simplifies to RA-VQA-FrDPR leaves the retriever frozen after pre-training and fine-tunes the generator only. RA-VQA-NoPR is a version of RA-VQA in which document retrieval is trained only with model predictions. The loss function is as Equation 6, but with positive and negative document sets defined as RA-VQA-NoCT replaces the customized generation targets by the single most popular response (s * k becomes s * in Equation 6) so that the generator is trained to produce the same answer from every retrieved document.

Evaluation
The following metrics are applied to assess the quality of individual answers generated and documents retrieved. Average scores are then computed over the evaluation set. The average of 3 runs with different seeds is reported.

Answer Evaluation
VQA Score: We follow Marino et al. (2019) to compute VQA Scores using pre-processed human annotations S: where # S (y) is the number of annotators who answered y. This score ensures that a model is partially rewarded even if it generates one of the less popular answers from amongst the human responses.

Retrieval Evaluation
Following Luo et al. (2021), we use pseudo relevance to ascertain whether the retrieved documents are relevant to the response. It concerns pseudo relevance instead of actual relevance but is still a reasonable metric for retrieval evaluation.
Pseudo Relevance Recall (PRRecall)@K measures how likely the retrieved K documents contains at least one positive document:

Integrated System Evaluation
The above methods evaluate retrieval and answer generation as separate processes. We propose additional metrics that assess how the two processes behave in an integrated VQA system.
The Hit Success Ratio (HSR) counts questions that require external knowledge to answer: HSR reflects the value of incorporating external documents into answer generation. By contrast, Free Success Rate (FSR) counts questions that can be answered without external knowledge.
A high FSR suggests a model can generate correct answers 'freely' without being distracted by retrieved documents if they are not needed.
We also assess performance as a function of the number of documents retrieved during training and testing, K train and K test . In practice, K train has the greater effect on GPU usage, since a large K train requires at least K train forward passes for each question and an Adam-like optimizer must compute and store the associated gradients (Kingma and Ba, 2015). In contrast, GPU memory required during testing is significantly less, as there is no optimizer involved. We are in particular interested in the ability of knowledge-augmented systems that can be robustly trained with small K train while yielding improved performance with large K test .

Retrieval Augmented Generation
RAG (Lewis et al., 2020) is based on DPR and an answer generator that are trained jointly by approximately marginalizing the probability of y over the retrieved documents. In the notation of Sec. 3: The answer generator and the retriever are jointly trained by optimizing the RAG loss: − (x,S)∈T log p RAG (s * |x) . The rationale is that p θ (z k |x) will increase if z k has a positive impact on answer generation (Lewis et al., 2020). We consider RAG as an important baseline and have carefully replicated its published implementation 1 .

Baseline Systems in the Literature
We compare against the published OK-VQA results from systems described in Sec. 2: ConceptBERT, KRISP, MAVEx, and VRR. We also report performance against unpublished (non peer-reviewed) systems TRiG, PICa, and KAT. TRiG uses a similar image-to-text transform as this work, so to enable fair comparison with our model we replicate their knowledge fusing method with our features. Baseline results are reported in Table 1; baseline results marked * are our own. TRiG*, our own implementation of TRiG, concatenates K encoder outputs for the decoder to use in generation. We make some particular observations. Our TRiG* improves over the results released in its paper (VQA Score of 48.32 vs 45.51) at K train = K test = 5; TRiG and TRiG Ensemble both benefit from more retrieved documents in training and testing (K train = K test = 100), although at great computational cost. Best performance with KAT-T5 and VRR similarly requires large document collections in training and in test.
We include results from GPT-3 based systems because they are amongst the best in the literature, but we note that GPT-3 is so much bigger than T5 (175 billion parameters in GPT-3 v.s. 770 million in T5-large) that simply switching a system implementation from T5 to GPT-3 can give significant improvements: KAT-T5 achieved a 44.25 VQA Score while ensembling it with GPT-3 yields 54.41; and GPT-3 alone already achieved good performance with prompting (PICa with 48.00 VQA Score). Our RA-VQA system is based on T5, but we still find competitive results even in comparison to systems incorporating GPT-3 (54.48 vs 54.41 of KAT-Ensemble).

RA-VQA Performance Analysis
We find that RA-VQA matches or improves over all baseline systems with a VQA Score of 54.48. This is with a configuration of K train = 5 and K test = 50, thus validating our claim that RA-VQA can use a large number of retrieved documents in testing (50)  in training (5). We find that reducing the number of retrieved documents in test (K test = 5) reduces the VQA Score, but still yields performance better than all baselines except the KAT ensemble. We also find that RA-VQA performs well relative to GPT-3 baselines. RA-VQA yields a higher VQA score than KAT-Knowledge-T5 (54.48 vs. 51.97) and matches the KAT-Ensemble system. We emphasize that RA-VQA is significantly smaller in terms of parameters (and in model pre-training data) than these GPT-3 based systems and that training RA-VQA requires much less memory (K train = 5 vs K train = 40).  A detailed ablation study on input features is presented in Table 2. As shown, the T5 model fine-tuned on OK-VQA achieves a 28.05 VQA Score. The VQA Score increases to 46.16 as objects, object attributes, image captions, and OCR texts are incorporated into RA-VQA-NoDPR. With 5 retrieved documents, RA-VQA-FrDPR yields a 51.22 VQA Score, with a further improvement (53.81 VQA Score) in full training of retrieval and answer generation, confirming that outside knowledge is needed to answer OK-VQA questions.

Benefits of Integrated Training
Joint training is a key benefit of our proposed RA-VQA framework: model predictions combine with pseudo relevance labels to improve retrieval, and the resulting improved retrieval in turn provides customized answer generation targets. To quantify these effects, we take RA-VQA-FrDPR as a starting point (Table 1). Comparing it with other RA-VQA models suggests that DPR training in itself is necessary, as using only pre-trained DPR (RA-VQA-FrDPR) leads to weaker VQA Score (51.22). Using model predictions alone in joint DPR training (RA-VQA-NoPR) leads to a higher VQA Score (52.98 vs 51.22), but a significantly lower PRRecall (77.67% vs 81.25%). The model decides to remove some pseudo relevant documents but achieves better performance. This points to a potential problem that can arise. Pseudo relevance is only an imperfect indication of true relevance and so is not an ideal criteria on its own. Training DPR to retrieve pseudo relevant documents could result in misleading documents being used in answer generation.
Using both pseudo relevance labels and model predictions in DPR training (RA-VQA) improves VQA Score to 53.18 and notably improves PRRecall to 82.84%. Including pseudo relevance ensures that potentially useful documents are retained, even while the generator is still learning to use them.
We also note that when generation targets are not customized for each retrieved document (RA-VQA-NoCT), VQA Score drops by 1.14 relative to RA-VQA, showing that customized generation targets play an important role in the overall system: by training the model to extract the reliable answers available in retrieved documents, answer generation and retrieval are both improved.

Interaction of Retrieval and Generation
Table 1 also reports our investigation into the interaction between document retrieval and answer generation. In comparing RA-VQA-FrDPR (frozen DPR) to RA-VQA, we see that joint training of DPR yields not only improved EM but also significantly higher HSR (+1.74%) and FSR (+1.21%). Manual inspection of OK-VQA reveals that there are many general knowledge questions. For example, document retrieval is not needed to answer the question "Is this television working?" in reference to a picture of a broken television lying in a field. A high FSR indicates good performance on such questions. By contrast, a high HSR reflects the ability to use document retrieval to answer the questions that truly require external documents.
Both EM and HSR are further improved for K test = 50 in RA-VQA, with little change in FSR. The increased HSR to FSR ratio (0.41 vs. 0.40) indicates that RA-VQA is using these additional retrieved documents to answer the questions that need outside knowledge.
HSR and FSR also explain the relatively weak performance of RAG*. We see that although RAG* and RA-VQA-FrDPR have similar FSRs, RAG* has higher PRRecall but lower HSR (by -2.73%). This suggests RAG*'s DPR model is not well matched to its answer generator. The result is that retrieved documents remain unexploited. In manual examination of gradients of document scores in training, we find anecdotally that adjustments to document scores are often counter-intuitive: documents that do not contain answers can still have their scores upvoted if the answer generator happens to find a correct answer by relying only on the ability of T5 model. This works against a model's ability to find answers in retrieved documents even when those documents are relevant. As noted, retrieving a large collection of documents in training is costly (large K train ). Fig. 3 shows that RA-VQA can be trained with relatively few retrieved documents (K train = 5). We gradually increase K train while fixing K test = K train (dash lines) and K test = 50 (solid lines). RA-VQA achieves consistent performance (∼54.4 VQA Score) at K test = 50, which suggests that our joint training scheme is able to gather most useful knowledge into a top-50 list even when the model is trained to retrieve fewer documents. This is not the case for the frozen DPR systems which require increasing K train to obtain best performance. RA-VQA's superior performance shows that joint training of retrieval and generation yields clear benefits in computation and answer quality. Additional analyses and case studies are in Appendices B-C.

Conclusion
Retrieval-Augmented Visual Question Answering is a novel modelling framework for integrated training of DPR and answer generation. We have evaluated RA-VQA on the OK-VQA task and we find significantly better performance than the independent training of component system. Through diagnostic metrics such as HSR and FSR we analysed the interaction between retrieval and generation, and have also shown how RA-VQA's gains arise relative to other approaches, such as RAG. As a further practical benefit, we found that RA-VQA can be used with larger numbers of retrieved documents than were used in system training, yielding computational savings without sacrificing performance.

Acknowledgement
W. Lin was supported by a Research Studentship funded by Toyota Motor Europe (RG92562(24020)). We thank our colleagues, Daniel Olmeda Reino and Jonas Ambeck, who provided insight and expertise in this project.
We thank Zhilin Wang (University of Washington) and Alexandru Coca (University of Cambridge) for comments that greatly improved the manuscript. We would also like to thank all the reviewers for their knowledgeable reviews.

Limitations
One possible limitation is that some relevant information (such as relative positioning of objects in the image) could be lost in transforming images independently of the information being sought. Extracting visual features based on queries could be a natural next step, although query-specific processing of the image collection would be computationally expensive.
We selected the Google Search corpus (Luo et al., 2021) as the knowledge base for our question answering system. Its advantages are that it is large, openly available, and can be readily used to replicate the results in this paper. However some visual question types (e.g. 'Is the athlete right or left handed?') could plausibly require both complex reasoning and more closely relevant documents from additional knowledge sources (such as Wikipedia). Our system may be limited with respect to these considerations.

B Supplementary Tables and Figures
Limited by space available, we present supplementary tables and figures in this section, offering more findings to readers.
B.1 Full Version of Table 1 We provide the full version of Table 1 in Table 4. Some more discussions about customized generation target in joint training is provided. As noted, RA-VQA improves retrieval with the feedback of model predictions, and in turn the improved retrieval leads to improved answer generation by training towards s * k , a customized generation target for each retrieved document z k . We remove this interaction from RA-VQA models by enforcing s * k = s * (the most popular human response), independent of the retrieved z k . The ablated models are denoted with a * -NoCT suffix.
As shown in Table 4, customizing generation targets for each retrieved z k in training yields performance boost for both RA-VQA-FrDPR and RA-VQA, showing that this supervision signal is beneficial to overall system performance. We also notice that the improvement to RA-VQA (+1.14 VQA Score) is larger compared to RA-VQA-FrDPR (+0.56 VQA Score), showing that customizing the generation target brings more benefits when the retrieval is improved within our proposed RA-VQA joint training framework. This further confirms that the two components, retrieval and answer generation, complement each other bi-directionally.

B.2 Retrieval Performance of RA-VQA
In addition to Pseudo Relevance Recall (PRRecall) introduced in the paper, we further evaluate retrieval performance with Pseudo Relevance Precision (PRPrec)@K, which is calculated as the rate of pseudo positive documents in all the K documents retrieved for a question: where H(·) is the pseudo relevance function introduced in Sec. 3.2. The success of our RA-VQA model can be further explained by Table 3. As expected, RA-VQA-FrDPR (pre-trained DPR) achieves similar retrieval performance as VRR (Luo et al., 2021) since they are both based on DPR and are trained with the same pseudo-relevance-based labels. Our proposed RA-VQA, with a substantial improvement in Recall over RA-VQA-FrDPR (82.84 PRRecall@5 vs 81.25 PRRecall@5), achieves significantly higher Precision (57.39 PRPrec@5 vs 51.82 PRPrec@5). This also yields substantial improvements to both EM (+3.05%) and VQA Score (+2.59%). This suggests that training the retriever jointly presents more potentially relevant documents to answer generation, improving the quality of the top-ranked documents. PRRecall is improved dramatically as K test increases from 5 to 50, after which only marginal improvement is observed. Similarly, the VQA Score of these models is improved as more documents are presented in test, and the performance peaks at K test ∼ 50. This suggests that including more additional documents in test is more likely to include the truly relevant document to help answer the question yet along with more distracting and misleading documents.

B.3 Effects of Retrieving More Documents in Test
RA-VQA-NoPR (introduced in Sec. 4.1), which uses only model predictions in training to adjust document scores without pseudo relevance labels, yields a significantly lower PRRecall curve (orange curve in Fig. 4(b)) than RA-VQA-FrDPR (blue curve) across all K test s, but achieves much higher VQA performance ( Fig. 4(a)). This further confirms that Pseudo Relevance Labels are a weak signal and a high PRRecall does not necessarily guarantee to gather truly relevant knowledge in retrieval.

C Case Study
We present a case study in Fig. 5 to compare RA-VQA-FrDPR and our proposed RA-VQA framework. Conclusions are provided to each case in the figure.
How many teeth does this animal use to have?

RAVQA-Frozen
RA-VQA sucessfully retrieved more relevant documents. What position does the man with the bat play?

RAVQA-Frozen
Even with the same retrieved document, RA-VQA learned to retrieve correct answers.
catcher is a position for a baseball or softball player. when a batter takes their turn to hit, the catcher crouches behind home plate, in front of the (home) umpire, and receives the ball from the pitcher..... Pred: Catcher .....

RAVQA (ours)
catcher is a position for a baseball or softball player. when a batter takes their turn to hit, the catcher crouches behind home plate, in front of the (home) umpire, and receives the ball from the pitcher...... What is the active ingredient in this?

RAVQA-Frozen
i credit this "secret ingredient" for being the greatest offender in this recipe. Pred: toothpaste ingredients on the list that end in "ose"-fructose, maltose, sucrose...... Pred: toothpaste Pred: toothpaste natural cheese is made from only four ingredients: milk, salt, starter culture (good bacteria) and rennet (an enzyme)

RAVQA (ours)
active ingredient sodium fluoride 0.21% (0.12% w/v fluoride ion) purpose anticavity toothpaste use helps protect teeth and roots against cavities warnings keep out of reach of children under 6 years of age. Retrieval was improved by RA-VQA, which leads to sucessful answering with given knowledge.
Questions about "ingredient" are common in food domain, and misleading material may be presented by Pseudo Relevance Labels to Answer Generator, leading to failed answering.  Figure 5: A case study comparing RA-VQA-FrDPR (baseline) and our RA-VQA that benefits from joint training of retrieval and answer generation.