Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models’ performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our experimental results show that a good retriever can significantly improve the reader’s performance on the OK-VQA challenge. The code and corpus are provided in https://github.com/luomancs/retriever_reader_for_okvqa.git.


Introduction
Knowledge-based VQA is a challenging task, where knowledge present in an image is not sufficient to answer a question. It requires a method to seek external knowledge. Figure 1 shows two examples from the OK-VQA benchmark (Marino et al., 2019), which is normally used to study knowledge-based VQA. In each of the two examples, external knowledge is needed to answer the question. For instance, in the first example, to identify the vehicle used in the item shown in the image (top-left), a system needs to first ground the referred item as a fire hydrant and then seek external knowledge presented top-right of the image. The challenge is to ground the referred object in the image and retrieve relevant knowledge where the answer is present.
Although the OK-VQA benchmark encourages a VQA system to rely on external resources to answer the question, it does not provide a knowledge corpus for a QA system to use. As such, existing methods rely on different resources such as Concept-Net (Speer et al., 2017), WordNet (Miller, 1992), and Wikidata (Vrandečić and Krötzsch, 2014), resulting in the following issues: 1. It is difficult to fairly compare different VQA systems as it is unclear whether the difference in performance arises from differing model architectures or the different knowledge sources. 2. The different formats of the knowledge sources, such as the structured ConceptNet and the unstructured Wikipedia, demand different modules to retrieve knowledge, consequently making a knowledge-based VQA system complicated. 3. External resources like ConceptNet and Word-Net have limitations. First, they only cover a limited amount of knowledge. For example, ConceptNet provides only 34 relation types, and there is a vast amount of knowledge that is hard to be described by a relation in a knowledge graph, such as, describe the logo of Apple Inc. Second, constructing a structured knowledge base requires heavy human annotation and is not available in every domain. Thus, it limits the application of a knowledge-based VQA system that relies on a structured knowledge base. Therefore, there is a need for a general and easyto-use knowledge base. Motivated by this, we collect a knowledge corpus for the OK-VQA benchmark. Our corpus is automatically collected via Google Search * by using the training-split question and the corresponding answers, and we provide a training corpus with 112,724 knowledge sentences and a full testing corpus with 168,306 knowledge sentences. The knowledge corpus is in a uniform format, i.e., natural language. Thus, it is easy to use by other OK-VQA methods. As we will show in §6, the knowledge base provides rich information to answer OK-VQA questions.
Utilizing the curated corpus, we further develop a weakly-supervised Visual-Retriever-Reader and evaluate it on the OK-VQA challenge. It consists of two stages, as seen in Figure 2. In the first stage, the visual retriever retrieves relevant knowledge from the corpus. In the second stage, the visual reader predicts an answer based on the given knowledge. Such a pipeline is well-studied in text-only opendomain QA (Chen et al., 2017a;. We apply its principles to the multi-modal vision and language domain with novel adaptations. On the retriever side, we introduce visual information and evaluate a cross-modality model and a text-only caption-driven model ( §4.1). On the reader side, we build two visual readers, a classification and an extraction type, with both utilizing visual information ( §4.2). We observe in §6, our Visual-Retriever-Reader pipeline performs strongly on the OK-VQA challenge and establishes a new state-of-the-art.
Our experiments reveal multiple insights. First, we find that the image captions are very useful for both visual retriever and visual reader, which demonstrates the application of image captioning generator on knowledge-based VQA tasks. Second, a neural retriever has much better performance than * https://developers.google.com/ custom-search/v1/ a term-based retriever. This observation is quite interesting as in the NLP domain, typically, a termbased retriever (e.g., TF-IDF and BM25) is a hardto-beat baseline (Lee et al., 2019a;Ma et al., 2021), suggesting an essential role of neural retrievers in the vision-&-language domain. Third, similar to the NLP domain, where a reader can perform well if the given knowledge contains relevant information, we discover that our visual reader has a significant leap when using noisy knowledge and high-quality knowledge. It motivates the need for developing a more efficient visual retriever for knowledge-based VQA tasks.
Our contributions are three folds. First, we build a general easy-to-use knowledge corpus for the OK-VQA benchmark, which makes model evaluation fair. Second, we propose a Visual-Retriever-Reader pipeline adapted from the NLP domain for the knowledge-based VQA task. Our model establishes a new state-of-the-art. Third, our experiments reveal several insights as mentioned above, and open a new research direction.

Related Work
Knowledge-based VQA. Many benchmarks have been proposed to facilitate the research in knowledge-based VQA. FVQA (Wang et al., 2017a) is a fact-based VQA dataset that provides image-question-answer-supporting fact tuples, where the supporting fact is a structured triple, e.g., Cat, CapableOf,ClimbingTrees . KB-VQA (Wang et al., 2017b) dataset consists of three types of questions: "Visual" question answerable using the visual concept in an image, "Common-sense" questions answerable by adults without looking for an external source, and "KBknowledge" questions requiring higher-level knowledge, explicit reasoning, and external resource. KVQA (Shah et al., 2019) consists of questions requiring world knowledge of named entities in images. Specifically, the questions require multientities, multi-relation, multi-hop reasoning over Wikidata. KVQA is challenging, as linking the named entities in an image to the knowledge base is hard on a large scale. OK-VQA (Marino et al., 2019) covers 11 types of knowledge than previous tasks, such as cooking and food, science and technology, plants and animals, etc. VLQA (Sampat et al., 2020) consists of data points of imagepassage-question-answer, it is proposed recently to facilitate the research on jointly reasoning with Question: Where did this sport originate? Caption: a man riding a wave on a surfboard in the ocean.
Knowledge Base surfing was invented in hawaii. the exact person know for developing surfing is unknown, as the sport originated among the early polynesian peoples...

Prediction: hawaii
Visual Retriever Visual Reader Figure 2: Visual Retriever Reader Pipeline: given an image and a question, a visual retriever is first to retrieve relevant knowledge, and then a visual reader is to predict an answer based on the given knowledge.
both image and text.
OK-VQA Systems. Out of the Box (Narasimhan et al., 2018) utilizes the Graph Convolution Networks (Kipf and Welling, 2017) to reason on the knowledge graph (KG), wherein each node image and semantic embeddings are attached. Mucko (Zhu et al., 2020) goes a step further, reasoning on visual, fact, and semantic graphs separately, and uses cross-modal networks to aggregate them together. ConceptBert (Gardères et al., 2020) combines the BERT-pretrained model (Devlin et al., 2019) with KG. It encodes the KG using a transformer with a BERT embedding query. KRISP (Marino et al., 2020) involves a BERT-pretrained transformer model to make a better semantic understanding and utilize the implicit knowledge and reasons on a GCN model. Span-Selector (Jain et al., 2021) extracts spans from the question to search most relative knowledge from Google, whereas MAVEx (Wu et al., 2021) votes among textual and visual knowledge from Wikipedia, ConceptNet, and Google Image. Besides knowledge collection, knowledge alignment (Shevchenko et al., 2021) also helps acquire a correct answer from knowledge.
Open-Domain Question Answering or ODQA tasks target collecting information from a large corpus to answer a question. The advanced reading comprehension model (Chen et al., 2017a;Banerjee et al., 2019) split this complex task into two steps: a retriever selects some most relevant documents from a corpus to a question, and a reader produces answer according to the documents from retriever. Some previous work (Kratzwald and Feuerriegel, 2018;Lee et al., 2018;Das et al., 2019;Wang et al., 2018) train the end-to-end models to rerank in a closed set. Although these mod-els are better at retrieval, they can hardly scale to larger corpora. Open-Retrieval Question Answering (ORQA) (Lee et al., 2019b) and Dense Passage Retriever (DPR)  constructed a dual-encoder architecture with BERT pre-trained model. This dense retrieval model shows a better performance than classic TF-IDF or BM25-based ODQA models on several natural language benchmarks.

Knowledge Corpus Creation
The overall process of knowledge corpus creation ( Figure 3) consists of following four steps.
Step 1: Query Preparation Based on the assumption that the knowledge used for answering training set questions can also help in testing, the OK-VQA training questions are used with their answers to collect related knowledge from a search engine. We concatenate each question with each answer to get a "Question, Answer" pair. For example, in Figure 3, the question "What is the natural habitat of these animals?" has four answers, and each answer is attached to the question one by one to construct four queries.
Step 2: Google Search Webpage The generated queries are sent to Google Search API to obtain knowledge. As presented in Figure 3, a good search result web page contains a title, a link, and a snippet that consists of multiple complete or incomplete sentences and shows the most relevant part to the query. The top ten web pages with their snippets as the raw knowledge are chosen.
Step 3: Snippet Processing The snippets from Google searching results consist of multiple sentences, some are complete but some are not. One option is to split snippets into multiple sen-  Figure 3: The overall process of Knowledge Corpus Creation. The question first combines the answers one by one to form a query, and then the query is sent to the Google Search API to retrieve the top 10 webpages. The knowledge is obtained from the snippet with further processing. Finally, we integrate the knowledge into the corpus. As shown in the searching result page, the black boxes represent webpages, and red boxes represent snippets.
tences, but experimental result shows sentencelevel knowledge is worse than snippet-level. Thus, we choose to use snippet as a knowledge. To address incomplete sentence issue, we find and grab the complete sentence present in the webpage. After this pre-processing, ten snippet-knowledge from each "Question, Answer" query are selected.
Step 4: Knowledge Processing We first remove the duplicated data among each "Question, Answer" pair. Then long knowledge (more than 300 words) or short knowledge (less than ten words) are removed. Pycld2 † is applied in this step to detect and remove the non-English part of each knowledge. Each knowledge is assigned a unique ID and duplicate knowledge sentences are removed. We curate in total 112,724 knowledge sentences for the OK-VQA training set.

Visual Retriever-Reader Pipeline
We present our Visual Retriever-Reader pipeline for the OK-VQA challenge, where the visual retriever aims to retrieve relevant knowledge, and the visual reader aims to predict answers given knowledge sentences. This scheme has been widely used in NLP (Chen et al., 2017b; . While previous work focuses on pure text-domain, we extend this to the visual domain with novel adaptation.

Retriever
We introduce two styles of visual retriever: termbased and neural-network-based. In the neural style, we further introduce two variants. Following the convention, we use the standard terms in † https://pypi.org/project/pycld2/ next subsection, for example, in §4.1, we use documents and in §4.1, we use context, both of them are knowledge in our task.
Term-based Retriever. In BM25 (Robertson and Zaragoza, 2009), each query and document is represented by sparse vectors in d dimension space, where d is the vocabulary size. Then the score of a query and a document is computed based on the inverse term's frequency. BM25 can only retrieve documents for a query in text format, but an image is a part of a query in our task. To tackle this issue, we first generate image captions using a caption generation model. Then we concatenate the question and the caption as a query and obtain a list of documents by BM25.
Neural Retriever. Unlike BM25, neural retrievers extract the dense representations for a query and a context from the neural model(s). We use DPR  as a neural retriever, which employs two BERT (Devlin et al., 2019) models to encode the query and context respectfully, then applies inner-dot product to estimate the relevancy between a query and a context. Similar to BM25, the DPR model considers the query in text format. To adapt DPR in the visual domain, we propose two methods. Image-DPR: we use LXMERT (Tan and Bansal, 2019) as the question encoder, which takes image and question as input and outputs a cross-modal representation. Caption-DPR: similar to the strategy we use in term-based retriever, we concatenate the question with the caption of an image as a query and use standard BERT as a query encoder to get the representation. In both Image-DPR and Caption-DPR, we use stan- dard BERT as context encoder. Figure 4 shows the architectures of standard DPR, Image-DPR and Caption-DPR. To train neural retriever, we use inner-dot product function to get the similarity score of relevant and irrelevant knowledge to a question, and optimize the negative log-likelihood of the relevant knowledge.

Reader
Classification Reader (CReader). Current state-of-the-art VQA systems are classification models (Tan and Bansal, 2019;Li et al., 2019;Gokhale et al., 2020b,a;Banerjee et al., 2021), where a list of answer candidates are pre-defined (from the training set), i.e., a fixed answer vocabulary, then a model classifies one of the answers as the final prediction. We build a reader in this style but incorporate external knowledge.
In particular, given a question, an image, and a piece of knowledge, we first concatenate the question with the knowledge and then apply a cross-modality model to encode the text with the image and generate a cross-modal representation. We feed this representation to a Multiple Layer Perceptron (MLP) which finally predicts one of the pre-defined answers. We apply Cross-Entropy Loss to optimize the model. In this work, we use LXMERT (Tan and Bansal, 2019), while any other cross-modality models like VisualBERT  can be adapted.
Extraction Reader (EReader). The classification model fails to generalize to out-of-domain answers, i.e., questions whose answers are not in the pre-defined answer vocabulary. To tackle this issue, we use an extraction model which is adapted from machine reading comprehension model (Chen et al., 2017b;. The model extracts a span (i.e., a start token and an end token) from the knowledge to answer the question. The image caption is given to the model as well to incorporate the image information. We also inject a special word "unanswerable" before the caption so that the model can predict "unanswerable" if the given knowledge can not be relied on to answer the question. This strategy is helpful since the retrieved knowledge might be noisy. We use a RoBERTalarge  as the text encoder, whose in- Then each token representation is fed to two linear layers: one predicts a score for a token being the start token, and the other predicts a score for the end token. We apply the softmax function to get the probability of each token being a start and end token. The training objective is to maximize the probability of the ground truth start and end token.

Training and Inference
Weak Supervision. The retriever is trained using weak supervision, as the ground-truth knowledge context is unknown for a given question-image pair. Particularly, given a query and an image, we assume that knowledge that contains any of the answers is relevant, and we use the in-batch negative samples  for training, i.e., in the training time, any relevant knowledge for other questions in the same batch are considered as irrelevant. For the reader, we use the same relevant knowledge as the retriever and when given such knowledge, the target is the answer. In addition, we use the same other collected knowledge which does not contain any answer as the irrelevant knowledge and in such a case, the reader should predict "unanswerable", a special word added to every knowledge. The reader may also be considered to be trained by weak supervision as the input knowledge is noisy, i.e., the assumed relevant knowledge is not guaranteed to be relevant.
Inference Strategy. We use the retriever to retrieve K knowledge (the value and effects of K will be presented in §7), and the reader predicts an answer based on each knowledge. We propose two strategies to predict the final answer. Highest-Score: the answer which has the highest score is the final prediction. Highest-Frequency: the answer which appears most frequently is the final prediction.

Retriever Evaluation
We evaluate the performance of a retriever based on Precision and Recall. The two metrics are based on the assumption that any retrieved knowledge that contains any of the answers annotated in the OK-VQA dataset is relevant. This assumption is because it is unknown which knowledge is relevant to a question-image pair. Therefore, the computation of Precision and Recall in our case is different from the traditional definition and illustrated as follows: Precision Precision reveals the proportion of retrieved knowledge that contains any of the answers to a question-image pair. Mean Precision is the mean of Precision of all question-image pairs. Mathematically, Recall Recall reveals if at least one knowledge sentence in the retrieved Knowledge contains any answers to a question-image pair. Mean Recall is the mean of the Recall of all question-image pairs. Mathematically, where the meaning of the symbols are the same described in Precision.

Answer Evaluation.
Original Evaluation In OK-VQA, each imagequestion pair has five answers annotated by humans.
To apply a similar evaluation as VQA (Antol et al., 2015a), OK-VQA counts per answer twice so that each image-question pair has ten answers, the same as VQA. The score is computed as follows.
score(A) = min( #human that said A 3 , 1) We use the above equation to compute the score of each answer for training and testing. Luo et al. (2021) propose an evaluation method for VQA, especially mitigating the issues of Synonym/Hypernym and Singular/Plural. Here, we adapt their method to the open domain setting, where Sentence Textual Entailment (STE) is used to find semantically similar answers. In STE, given a premise and a hypothesis, a score is generated to indicate whether a premise entails the hypothesis. In our case, a premise is a sentence that contains a gold answer, and a hypothesis is the same sentence while the gold answer is replaced by a predicted answer given by a model. Suppose a high STE score ‡ is generated for such a pair of premise and hypothesis. In that case, it implies that the predicted answer is semantically close to the gold answer and thus deserves a partial score. We provide the detailed steps of evaluation in Appendix A.

Open Domain Evaluation
Mathematically, the open-domain accuracy is given by the follows, where a is a predicted answer, Ans is a set of ground truth answer of a pair of question(Q) and image(I), P a j is a set of sentences with placeholder, E(a j , a , g i ) is the entailment score of a premise (sentence g i grounded by gold answer a j ) and hypothesis (sentence g i grounded by predicted answer a ), and S(Q, I, a j ) is the ground truth score of a j which has highest entailment score with predicted answer a .
The main difference between our evaluation and (Luo et al., 2021) is that in their evaluation, they extend each answer with a set of alternative answers which are selected from the list of pre-defined answers in the training set. While in our setting, we remove this step since in the open domain setting, many semantically similar answers can be found in the open corpus which is not necessary in the training set (see examples in Appendix B).

Baselines
We use a state-of-the-art vision-language model, LXMERT (Tan and Bansal, 2019), as the baselines ‡ We only credit those predicted answers which obtain an STE score higher than 0.5. and apply Captioning and Optical Character Recognition (OCR) results to the OK-VQA dataset to the original LXMERT model.
LXMERT with OCR The OCR technique captures the textual contents from the image and transfers them into characters. Here we use Google Vision API § to extract the texts from images. After the noise deduction step filtering all non-English words, we attach the OCR results after the question and then sent them into the LXMERT model. Our experiment shows that the OCR result helps to address the OK-VQA task.
LXMERT with Captioning Similar to OCR, we also experiment with adding captioning when training the LXMERT model. The captions are generated by the advanced model Oscar (Li et al., 2020) and attached to each question when sent into the LXMERT model. Our result shows that captioning improves the performance of the LXMERT model, and therefore, we put the LXMERT with captioning as a baseline as well. Table 1 shows that our best model based on Caption-DPR and EReader outperforms previous methods and establishes the new state-ofthe-art result on the OK-VQA challenge. Interestingly, the LXMERT baseline without utilizing any knowledge achieves better performance than KRISP (Marino et al., 2020) and Concept-Bert (Gardères et al., 2020) which leverage external knowledge. Incorporating OCR and captioning further improve the baseline accuracy by 1% and 1.6%, respectively.

Main Results
Among different variations of Visual Retriever-Reader, the best combination is Caption-DPR and CReader when the retrieved knowledge size is 80. We evaluate retrievers' performance based on Precision and Recall.

Analysis
Effects of the Quality of Knowledge. A common observation in open-domain question answering in NLP is that the reader can perform well if the given knowledge is good to answer a question. Here, we are interested to see if this also holds for our reader. Specifically, before we feed the retrieved knowledge to the reader, we remove knowledge that does not contain any answer, then we send the remaining knowledge to the reader. The last row in Table 1 shows that our reader can perform much better if the quality of the knowledge is good, suggesting that a more efficient cross-modality retriever is needed.   Effects of Size of Retrieving Knowledge and Prediction Strategy. The performance of reader is directly affected by the size of retrieved knowledge. A more extensive knowledge set is more likely to include the relevant knowledge to answer the question yet along with more distracting knowledge. In contrast, a small set might exclude relevant knowledge but with fewer distracting knowledge. We use Caption-DPR to retrieve the different number of pieces of knowledge and use the EReader to predict an answer given the different number of pieces of knowledge. We compare the effects on two prediction strategies mentioned in §4.3. Figure  5 shows the comparison, and we have the following observations. First, when the knowledge size is small (equal or less than 5), the Highest-Score strategy is better than the Highest-Frequency; on the other hand, when the knowledge size is large, the Highest-Frequency strategy performs better than the Highest-Score strategy. Second, for the Highest-Score strategy, the size of 5 is the best, and increasing the knowledge size reduces the performance. Third, for the Highest-Frequency strategy, when the size equal to 80, it yields the best performance. To summarize, if one uses a small set of knowledge, then Highest-Frequency negatively impacts the accuracy and the Highest-Score strategy is preferable. If one uses a larger corpus of knowledge, the Highest-Frequency strategy can achieve higher accuracy. Effects of Completeness of Corpus. So far, when we test the model performance, we use the knowledge corpus collected only by training questions. However, if the entire training corpus does not include relevant knowledge to testing questions, our model is under-evaluated because of the incompleteness of the knowledge corpus. To fairly see how our model performs when the knowledge corpus is complete, we use the same knowledge collection method described in §3 to collect knowledge for testing questions. Then we combine the training and testing knowledge as a complete corpus, which increases the corpus size from 112,724 to 168,306. We use Caption-DPR to retrieve knowledge from the complete corpus and ask EReader to predict answers based on these pieces of knowledge. Table  3 shows the increase of recall. As we expected, a complete corpus is helpful for Caption-DPR even though the corpus size increased, thus yields better performance of EReader. Figure 6 compares the accuracy of EReader using knowledge retrieved from two corpora. EReader consistently achieves higher performance using the knowledge retrieved from complete corpus, where the biggest gain of 7.86% is achieved when using five knowledge. We further clean up the corpus following similar steps in (Raffel et al., 2019), and 1% of the knowledge got removed from the initial ones. We provide the details in Appendix D.

Discussion
Training Corpus Bias One potential concern of our knowledge corpus is that the training corpus might tend to bias to the training set, i.e., the training corpus includes knowledge for the training set and excludes knowledge for some testing set. To alleviate such concern, we analyze the training and testing sets in OKVQA and find that 74.69% of testing answers overlap with the training answers, which indicates that a large portion of common knowledge is shared by training and testing sets. To further complement the training corpus, we also provide the entire corpus for testing (discussed in §7), which also includes relevant knowledge for the testing set. The entire corpus is larger than the training corpus and includes prior unseen knowledge. Thus, such testing corpus can evaluate the generalization ability of a retriever, which is an essential skill of any AI system (Mishra et al., 2021).
Extension. Although our pipeline is evaluated on the OK-VQA benchmark, it is generic and can be adapted for other knowledge-based question answering tasks such as FVQA (Wang et al., 2017a), KB-VQA (Wang et al., 2017b), and KVQA (Shah et al., 2019). For example, in KVQA, we can first collect a named-entity knowledge corpus by the proposed knowledge collection approach and then apply our Visual-Retriever-Reader pipeline. It should be noted that our proposed Extractive reader is a more challenging problem as classification models tend to learn correlation between output classes (answers) (Agarwal et al., 2020) and input image and question. In contrast, the extractive reader extracts answer-spans which we exactly match with targets (answers). Figure 6: EReader achieves significant improvement when using knowledge retrieved from complete corpus compared to knowledge from training corpus.

Conclusion
This paper collects an easy-to-use free-form natural language knowledge corpus for VQA tasks with external knowledge. A weakly-supervised Visual Retriever-Reader Pipeline, where the retriever introduces dense representation, and the reader contains classification and extraction two styles, is also evaluated. The Visual Retriever-Reader Pipeline has been evaluated on the OK-VQA challenge benchmark and has established a new state-of-the-art performance. Further analysis reveals that good knowledge from the retriever makes vital progress in predicting the correct answer. Besides, the captioning and the neural retriever can both significantly improve the QA system's performance.

Ethical Considerations
All the knowledge base data is fully automatically collected through the website source open to public. We select several sentences from these pages of passages for non-commercial use, which will not violate the intellectual property and privacy rights. The whole dataset is aimed to address the external knowledge source for Knowledge-based QA, and thus each piece contains some commonsense or encyclopedic knowledge. To ensure that the sentences are properly excerpted from the source page, we also visit the source website to extract the complete sentences. (see Step 3 of §3). Our experiment ( §6) also shows that the collected knowledge is helpful to answer the OK-VQA questions. Grounding In the grounding phase, we convert a question to a statement using the answers and predictions. Since a good prediction should be of the similar semantic meaning as the answers, we assume that for one question, every answer and prediction acts a same role in the grounded statement, and thus we ground the question with a reserved position for any answer to fill in. For example, the original question "Who invented this device?" is grounded to "_ invented this device.", where "_" can be any of the answers to this question. An example for grounding is shown in Figure 8.

References
To achieve this, a simple sentence role labeling work is applied to the questions to detect different elements in the sentence (e.g. question word, object, subject, auxiliary word, etc.) After settling the role of elements, the question is then re-ordered to accord with the word order of declarative sentences. We apply the above method to the wh-questions and choice questions, which in total cover the 98.6% of questions and 98.9% of unique answers. Table 4 shows some examples of grounded sentences.
Assembling In grounding step, the statements are gathered by question. We re-arrange the these grounded statements ordered by the provided answers for the further processing. Figure 8 provides an example for this assembling step.
Entailment The grounded sentences are then sent to the Natural Language Inference (NLI) model ¶ . NLI is used widely in the NLP tasks to check whether the hypothesis can be entailed from the given premise, and here we use NLI to check whether the correct answers and the predicted answer are semantically same. To compare between a provided answer and a predicted answer, we first list all grounded statements that use the provided answer as a correct answer. Then, for each of these statements, we fill the reserved position with the provided answer as the premise, and our predic-tion is the hypothesis, and calculate the entailment score. We use the arithmetic mean of these scores as the final entailment score.
The threshold is set to be 0.5. We also skip the choice questions and the questions with numbers as answers, since, with only grounded statements provided, it is hard to tell whether the two numbers or two choices are similar. For each question with multiple answers, we pick the highest entailment score as the similarity score. Figure 9 shows the steps acquiring the entailment score and calculating the final score for a predicted answer.

B Examples of Open Domain Evaluation
Here, we show four examples such that the predicted answer given by our model is in fact semantically close to one of the ground truth answer. Such predictions get 0 score given the original accuracy evaluation, but get reasonable score by our proposed open domain evaluation.

C Training Setup
Our neural retrievers were trained on eight Nvidia RTX8000 GPUs, where we set the training epoch to be 30, learning rate (lr) be 1e-5, batch size (bs) be 64, gradient accumulation step (gas) be 4. All the readers were performed at four GTX1080 and V100 NVIDIA GPUs. For both Image-DPR and Caption-DPR, In CReader, we set the training epoch as 3, lr as 2e-5, and batch-size as 16. In EReader, we set the training epoch as 3, lr as 1e-5, batch-size as 4, and gradient accumulation as 4.

D Dataset Cleaning
We cleaned our knowledge corpus following the steps in Section 2.2 of (Raffel et al., 2019). Specifically, we removed the knowledge that contains any word from "List of Dirty, Naughty, Obscene or Otherwise Bad Words". || Knowledge that includes "JavaScript" and "lorem ipsum" is also removed. We also eliminate every knowledge with curly bracket "{". Such cleaning steps remove 1% of the knowledge from the corpus, leading the clean training corpus size to be 111,412. Similarly, 1% of the knowledge from the full corpus is removed to make the clean full corpus size under 166,390. || https://github.com/LDNOOBW/ List%2Dof%2DDirty%2DNaughty% 2DObscene-and-Otherwise%2DBad%2DWords  Figure 7: This example calculates the entailment score of provided answer "wet suit" and our prediction "wet suits". We first ground all questions into statements with a reserved position "_" for the answer. Then, we congregate all the grounded statements by the provided answer. We replace the "_" with the provided and predicted answer separately as the premise and hypothesis to get the entailment score. The entailment score of a provided answer and a prediction is calculated as the mean of all the entailment scores under that answer in the assembling list. We take 0.5 as the threshold, and use the maximum as the final entailment score. After obtaining the clean corpus, we apply our best visual retriever and visual reader to the OKVQA challenge. Specifically, first, we apply Caption-DPR fetch 100 knowledge from the clean corpus, then ask EReader to predict the answer. When using the clean training corpus, the accuracy is 39.15, and the accuracy of the clean full corpus is 44.98.

Original Question
Grounded Statement What is this type of blanket called? this type of blanket is called _. What is the name of the board he is on? the name of the board he is on is _. The food in the photo contains which healthy vitamins? The food in the photo contains _ healthy vitamins.
Is this bathroom high or low end? this bathroom is _. Why is the cow going to the water? the cow is going to the water because of _. Table 4: Examples for some grounded sentences where the hypothesis gets score over the threshold.