Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine the retrieved knowledge with state-of-the-art VQA models, and achieve a new state-of-the-art performance on OK-VQA.


Introduction
Passage retrieval under a multi-modal setting is a critical prerequisite for applications such as outsideknowledge visual question answering (OK-VQA) (Marino et al., 2019), which requires effectively utilizing knowledge external to the image.Recently, dense passage retrievers with deep semantic representations powered by large transformer models have shown superior performance to traditional sparse retrievers such as BM25 (Robertson and Zaragoza, 2009) and TF-IDF under both textual (Karpukhin et al., 2020;Chen et al., 2021;Lewis et al., 2022) and multi-modal settings (Luo et al., 2021;Qu et al., 2021;Gui et al., 2021).
In this work, we investigate two main drawbacks of recent dense retrievers (Karpukhin et al., 2020;Chen et al., 2021;Lewis et al., 2022;Luo et al., 2021;Qu et al., 2021;Gui et al., 2021), which are typically trained to produce similar representations for input queries and passages containing groundtruth answers.
First, as most retrieval models encode the query and passages as a whole, they fail to explicitly discover entities critical to answering the question (Chen et al., 2021).This frequently leads to retrieving overly-general knowledge lacking a specific focus.Ideally, a retrieval model should identify the critical entities for the query and then retrieve question-relevant knowledge specifically about them.For example, as shown in the top half of Figure 1, retrieval models should realize that the entities "turkey" and "teddy bear" are critical.
Second, on the supervision side, the positive signals are often passages containing the right answers with top sparse-retrieval scores such as BM 25 (Robertson and Zaragoza, 2009) and TF-IDF.However, this criterion is inadequate to guarantee question relevancy, since good positive passages should reveal facts that actually support the correct answer using the critical entities depicted in the image.For example, as shown in the bottom of Figure 1, both passages mention the correct answer "vegetable" but only the second one which focuses on the critical entity "bell pepper" is question-relevant.
In order to address these shortcomings, we propose an Entity-Focused Retrieval (EnFoRe) model that improves the quality of the positive passages for stronger supervision.EnFoRe automatically identifies critical entities for the question and then retrieves knowledge focused on them.We focus on entities that improve a sparse retriever's performance if emphasized during retrieval as critical entities.We use the top passages containing both critical entities and the correct answer as positive supervision.Then, our EnFoRe model learns two scores to indicate (1) the importance of each entity given the question and the image and (2) a score that measured how well each entity fits the context of each candidate passage.
We evaluate EnFoRe on OK-VQA (Marino et al., 2019), currently the largest knowledge-based VQA dataset.Our approach achieves state-of-the-art (SOTA) knowledge retrieval results, indicating the effectiveness of explicitly recognizing key entities during retrieval.We also combine this retreived knowledge with SOTA OK-VQA models and achieve a new SOTA OK-VQA performance.Our code is available at https://github.com/jialinwu17/EnFoRe.git.

OK-VQA
Visual Question Answering (VQA) has witnessed remarkable progress over the past few years, in terms of both the scope of the questions (Antol et al., 2015;Hudson and Manning, 2019;Wang et al., 2018;Gurari et al., 2018;Singh et al., 2019), and the sophistication of the model design (Antol et al., 2015;Lu et al., 2016;Anderson et al., 2018;Kim et al., 2018Kim et al., , 2020;;Wu et al., 2019;Wu and Mooney, 2019;Jiang et al., 2018;Lu et al., 2019;Nguyen et al., 2021).There is a recent trend towards outside knowledge visual question answering (OK-VQA) (Marino et al., 2019), where open domain external knowledge outside the image is necessary.Most OK-VQA models (Marino et al., 2019;Gardères et al., 2020;Zhu et al., 2020;Li et al., 2020;Narasimhan et al., 2018;Marino et al., 2021;Wu et al., 2022;Gui et al., 2021;Gao et al., 2022) incorporate a retriever-reader framework that first retrieves textual knowledge relevant to the question and image and then "reads" this text to predicts the answer.As an online free encyclopedia, Wikipedia is often used as the knowledge source for OK-VQA.While most previous works focused more on the answer prediction stage, the performance is still lacking because of the imperfect quality of the retrieved knowledge.This work focuses on knowledge retrieval and aims at retrieving question-relevant knowledge that focuses explicitly on the critical entities for the visual question.

Passage Retrieval
Sparse Retrieval: Before the recent proliferation of transformer-based dense passage retrieval models (Karpukhin et al., 2020), previous work mainly explored sparse retrievers, such as TF-IDF and BM25 (Robertson and Zaragoza, 2009), that measure the similarity between the search query and candidate passage using weighted term matching.These sparse retrievers require no training signals on the relevancy of the passage and show solid baseline performances.However, exact term matching prevents them from capturing synonyms and paraphrases and understanding the semantic meanings of the query and the passages.Dense Retrieval: To better represent semantics, dense retrievers (Karpukhin et al., 2020;Chen et al., 2021;Lewis et al., 2022;Lee et al., 2021) extract deep representations for the query and the candidate passages using large pretrained transformer models.Most dense retrievers are trained using a contrastive objective that encourages the representation of the query to be more similar to the relevant passages than other irrelevant passages.During training, the passage with a high sparse retrieval score containing the answer is often regarded as a positive sample for the question-answering task.However, these positive passages may not fit the question's context and only serve as very weak supervision.Moreover, the query and passages are often encoded as single vectors.Therefore most dense retrievers fail to explicitly discover and utilize critical entities for the question (Chen et al., 2021).This often leads to overly general knowledge without a specific focus.

Dense Passage Retrieval for VQA
Motivated by the trend toward dense retrievers, previous work has also applied them to OK-VQA.Qu et al. (2021); Gao et al. (2022) utilize Wikipedia as a knowledge source.Luo et al. (2021) crawl Google search results on the training set as a knowledge source.However, the weak training signals for passage retrieval become more problematic for VQA as the visual context of the question makes it more complex.Therefore, a "positive passage" becomes less likely to fit the visual context and actually provide suitable supervision.In order to better incorporate visual content, Gui et al. (2021) adopt an image-based knowledge retriever that employs the CLIP model (Radford et al., 2021) pretrained on large-scale multi-modal pairs as the backbone.However, question relevancy is not considered, so the retriever has to retrieve knowledge on every aspect of the image for different possible questions.
This work proposes an Entity-Focused Retrieval (EnFoRe) model that recognizes key entities for the visual question and retrieves question-relevant knowledge specifically focused on them.Our approach also benefits from stronger passage-retrieval supervision with the help of those key entities.

Phrase-Based Dense Passage Retrieval
The most relevant work to ours is phrase-based dense passage retrieval.Chen et al. (2021) employ a separate lexical model that is trained to mimic the performance of a sparse retriever that is better at matching phrases.Lee et al. (2021) propose DensePhrase model that extracts each possible phrase feature in the passage and only uses the most relevant phrase to measure the similarity between the query and passage.However, the training signals still come from exactly matching ground truth answers, and the phrases are parsed from the candidate passage, limiting the scope of the search.In contrast, our approach collects entities from many aspects of the question and image, including object recognition, attribute detection, OCR, brands, captioning, etc., building a rich unified intermediate representation.

Entity Set Construction
Our EnFoRe model is empowered by a comprehensive set of extracted entities.Entities are not limited to phrases from the question and passages as in (Lee et al., 2021).We collect entities from the sources below.Most entity extraction steps are independent and can execute in parallel, except for answering sub-questions, which first requires parsing the questions.Parallelizing these steps can significantly reduce run time.

Question-Based Entities
Entities from Questions: First, the noun phrases in questions usually reveal critical entities.Following Wu et al. (2022), we parse the question using a constituency parser (Gardner et al., 2018) and extract noun phrases at the leaves of the parse tree.Then, we link each phrase to the image and extract the referred object with its attributes.We use a pretrained ViLBERT model (Lu et al., 2020) as the object linker.Entities from Sub-Questions: OK-VQA often requires systems to solve visual reference problems as well as comprehend relevant outside knowledge.Therefore, we employ a general VQA model to find answers to the visual aspects of the question.In particular, we collect a set of sub-questions by appending each noun phrase in the parse tree to the common question phrases "What is..." and "How is..." When the confidence for an answer from a pretrained VilBERT model (Lu et al., 2020) exceeds 0.5, it is added to the entity set.For the example in Fig. 2, the noun phrases "plush toy" and "president" generate the sub-questions: "What is plush toy?", "How is plush toy?", "What is president?", "How is president?".The answer confidence for "teddy bear" exceeds 0.5 for the first question, so we include it in the entity set.Entities from Answer Candidates: Standard state-of-the-art VQA models are surprisingly effective at generating a small set of promising answer candidates for OK-VQA (Wu et al., 2022(Wu et al., , 2020)).Therefore, we finetune a ViLBERT model (Lu et al., 2019) on the OK-VQA data set and extract the top 5 answer candidates and add them to entity set.

Image-Based Entities
Question-based entities are high precision and narrow down the search space for knowledge retrievers.To complement this, we also collect imagebased entities to help achieve higher recall.Entities from Azure tagging: Following Yang et al. (2022), we use Azure OCR and brand tagging to annotate the detected objects in the images using a Mask R-CNN detector (He et al., 2017).Entities from Wikidata: As suggested by Gui et al. (2021), common image and object tags can be generic with a limited vocabulary, leading to noise or irrelevant knowledge.Therefore, we also leverage recent advanced visual-semantic matching approaches, i.e.CLIP (Radford et al., 2021), to extract image-relevant entities from Wikidata.In particular, the entities with their descriptions in Wikidata and sliding windows of the images are used as inputs.Then, at most 18 entities with top maximum CLIP scores over these sliding windows are preserved.We follow the released code for KAT (Gui et al., 2021) and resize the image such that the size of the shorter edge is 384.The sliding window size is set to 256 with a stride of 128.Entities from Captions: Captions provide a natural source of salient objects in the image, and do not suffer from the limited vocabulary of object detectors (Wu et al., 2018).Similar to extracting entities from the question, we parse captions and extract noun phrases from the parse tree.During training, we use the human captions provided by the COCO dataset to provide richer entities, and during testing, we use generated captions from the OFA captioning model (Wang et al., 2022).

Oracle Critical Entity Detection
Given the comprehensive set of entities E covering different aspects of the question and image, we introduce an approach to automatically find critical entities and passages containing them.Then, those entities and passages are used during training to provide more substantial supervision.The intuition is that a good passage that fits the visual question's context should mention both the key entities and the correct answer.Also, emphasizing critical entities should improve retrieval performance.
Given a question q, we use BM251 (Robertson and Zaragoza, 2009) as the sparse retriever to retrieve an initial set of passages P init = {p 1 , ..., p K }.We calculate a baseline score SRR init for these K passages using summed reciprocal ranking (SRR) as shown in Eq. 1.
We use summed reciprocal ranking instead of reciprocal ranking since it provides more stable scores for evaluating the set of retrieved passages and does not overweight the highest ranked document.Then, for each entity e ∈ E, we retrieve another set of passages P e using an entity-emphasizing query where the entity is appended to the end of the question.Note that the BM25 retriever does not take word order into account, so simply appending entities will not lead to undesired results due to the linguistic disfluency of the query.
The final score for an entity S(e) is computed as the difference between the SRR of these two sets of retrieved passages, i.e.S(e) = SRR(P e ) − SRR(P init ).We regard entities with S(e) over a threshold θ as critical entities, i.e.E oracle = {e ∈ E|S(e) > θ}.
Qu et al. ( 2021) extract the top-k passages containing the correct answer from P init to construct the positive passage set P + init .As we have identified oracle entities, the passage that contains both the right answer and the oracle entity is more likely to fit in the context of the question.Therefore, we augmented the positive passage set to include those passages for each oracle entity, i.e.
, where p + e denotes the first passage that contains both the right answer and the oracle entity.On average, there are 3.4 new positive passages per question.The negative passages are the same as those in (Qu et al., 2021), and the number of training instances (positive-negative pairs) is not changed.

Entity-Focused Retrieval
Entity-Focused Retrieval (EnFoRe) automatically recognizes critical entities and retrieves questionrelevant knowledge specifically focused on them."proj" denotes a projection function that consists of an MLP layer with layer-norm as normalization.

Encoders
Query encoder: As observed by Qu et al. (2021) and Luo et al. (2021), multi-modal transformers encode questions and visual content better than uni-modal transformers, so we adopt LXMERT (Tan and Bansal, 2019) for query encoding.In particular, we project the "pooled_output" at the last layer from LXMERT as the feature vector f q ∈ R d given the query q that contains a visual question Q and the set of detected objects V in the image as shown in Eq. 2. See the LXMERT paper for further details. (2) Passage encoder: Following Qu et al. (2021), we use BERT (Devlin et al., 2019) as the passage encoder and project the "[CLS]" representation to compute the vector features for each passage p. Figure 2: EnFoRe model overview.We first extract a set of entities from the query consisting of a question and an image (Sec.3).Then, the EnFoRe model computes the features for the query, the entities, and the passages (Sec.4.1).Query features and passage features, together with entity features, are used to compute a query-entity score and a passage-entity score to indicate the importance of the entities given the query and the passages, respectively (Sec.4.2).These two importance scores are combined to produce an entity-matching score, and the features of the query and the passages are used to predict a query-passage matching score.
Entity encoder: In order to provide query context for each entity, we append the question and a generated image caption (Wang et al., 2022) after each entity.The input to the Entity encoder is "[CLS] entity [SEP] question [SEP] caption".Similar to the passage encoder, we use BERT (Devlin et al., 2019) as the entity encoder and project the "[CLS]" representation to compute the features for each entity.

Retrieval Scores
EnFoRe aims to retrieve question-relevant knowledge that focuses on critical entities.Therefore, the similarity metric consists of two parts: a question relevancy term and an entity focus term.
Modeling question relevancy: We model the question relevancy term S qp as the inner-product of the query and passage features, i.e. S qp (q, p) = f T q f p .During inference, as the query and passage features are decomposable, maximum inner product search (MIPS) can be applied to efficiently retrieve top passages for the query.Modeling entity focus: The entity focus term consists of two parts, where query features are used to identify critical entities from the set of entities in Sec. 3, and passage features are used to determine whether it contains these key entities.For each entity, we compute the query-entity score S qe (q, e) as the inner-product of the projected query and entity feature, i.e. S qe (q, e) = proj(f q ) T proj(f e ), and we compute the passage-entity score as S pe (p, e) = proj(f p ) T proj(f e ).Then, we combine all of the entities and compute the entityfocused score S qpe per Eq.5: S qpe (q, p, E) = e∈E σ(S qe (q, e)) × S pe (p, e) e∈E σ(S qe (q, e)) (5) where σ denotes the sigmoid function.Another way to interpret Eq. 5 is to treat it as modeling the conditional distribution Pr(p | q) and consider the entities as hidden variables.
The final score S(q, p) for the query q and passage p linearly combines both terms, i.e.S(q, p) = S qp (q, p) + λS qpe (q, p, E), where the weight λ controls the balance between the these two terms.

Training
We train our EnFoRe model with a set of training instances consisting of a query containing the visual question with an image, a positive passage, a retrieved negative passage, and the set of entities.We present more details on constructing the training data in Sec.6.1.We adopt the "R-Neg+IB-All" setting introduced by Qu et al. ( 2021) that regards the retrieved negatives, along with all

Val
Test MRR@5 P@5 MRR@5 P@5 BM25-Obj other in-batch passages, as negative samplings.Following previous work (Karpukhin et al., 2020), we use cross-entropy loss to maximize the relevancy score S qp (q, p) and the entity focusing score S qpe (q, p, E) of the positive passage given the negatives identified above.In addition, we regard the oracle entities, defined in Sec.3.3, as positive entities and others as negative entities.We use binary cross-entropy loss to supervise the importance score S qe (q, e).We use AdamW (Loshchilov and Hutter, 2018) with a learning rate of 1e-5 to train the EnFoRe model for 8 epochs where 10% of the iterations are used to warm up the model linearly.The batch size is set to 6 per GPU, and we use 4 GPUs (Tesla V100) for each experiment.
The training process takes about 45 hours for each model.We save the parameters every 5000 steps and present the best results (MRR@5) on the validation set.The hidden states size is set to 768 following Qu et al. (2021) for fair comparison.The threshold θ for recognizing critical entities is set to 0.8.As our model consists of a BERT encoder and a LXMERT encoder, resulting in 430M parameters total.Inference: As the question relevancy term is decomposable, we again adopt MIPS to retrieve the top-80 passages.Then, we evaluate the entity focus term for each passage and use the combined score S(q, p) to rerank the retrieved passages.

Reader
We employ the current state-of-the-art KAT model (Gui et al., 2021) as our VQA reader.KAT is a generation-based reader that learns to generate the answer given the retrieved knowledge.It adopts an FiD (Izacard and Grave, 2021) architecture to incorporate both implicit knowledge, generated by a frozen GPT-3 model, and explicit knowledge.For implicit GPT-3 knowledge, the input format is "question:ques?candidate:cand.evidence:expl.",where the ques, cand and expl.denotes the question, answer and its explanation generated by the GPT-3 model (Brown et al., 2020).For the explicit knowledge, the input format is "question:ques?entity:ent.description:desc.",where ent, desc denote the retrieved entity and its description.See (Gui et al., 2021) for further details.
We change the original explicit knowledge to the knowledge retrieved by our EnFoRe model.As the retrieved passage contains multiple sentences, and usually not all are relevant, we select the most relevant sentence for each passage.Specifically, following Wu et al. (2022), we convert the question and the candidate answers to a set of statements.Then, we decontextualize each sentence for each passage and compute the BertScore (Zhang* et al., 2020) between the decontextualized sentences and each statement.The sentence with the highest BertScore across these statements is extracted for each passage.The input format is "question:ques?entity:ents.description:desc.",where the ents, desc denote the top-10 entities judged by the query-entity importance score S qe (q, e) and the extracted sentence.
Following Gui et al. (2021), we perform experiments for two KAT settings: (1) "KAT-base + EnFoRe" setting is a single model that employs T5-base (Raffel et al., 2020) as the backbone encoder and decoder.(2) "KAT-full + EnFoRe" is an ensemble model, where each model employs T5-large as the backbone encoder and decoder.As our knowledge is question-aware, we only encode the top 10 retrieved sentences in contrast to the 40 sentences in the original KAT.We adopt the same training scheme as KAT.2021), we take the Wikipedia passage collection with 11 million passages created by previous work as our knowledge source, where each passage contains at most 384 "word pieces" with intact sentence boundaries.We extract 25 passages with the highest BM 25 scores (CombSum setting in (Qu et al., 2021)) that do not contain the correct answers as our retrieved negative samples, and the top 5 passages that contain the correct answer as retrieved positive samples for training.In addition, we also consider the most relevant passage that contains each of the oracle entities and the correct answer as positive passages.The positive and negative passages are randomly paired up to form the training instances.During evaluation, any passages containing at least one of the correct answers are considered as gold passages.For VQA models, we adopt the same model architecture and training scheme and only switch the external knowledge for the KAT models.Due to limits on computational resources, we adopt 10 retrieved sentences for the KAT model.The models are evaluated every 500 steps.We normalize the predictions by lowercasing, lemmatizing, and re-moving articles, punctuation and duplicated whitespace.We follow the standard evaluation metric recommended by the VQA challenge. 3The results for "KAT-base + EnFoRe" are obtained by averaging three runs with different random seeds.

Passage Retrieval Results
We present our passage retriever results in Table 1, comparing them with the current state-of-theart systems.We adopt MRR and Precision at a cut-off of 5 as our automatic evaluation metric.The first four rows present sparse retrieval results.The BM25 approach using our oracle entities achieves an MRR@5 of 0.6401, and a precision@5 of 0.4345 on the OK-VQA RetTest set, indicating the comprehensiveness and the potential helpfulness of the extracted entities.With the help of these entities, EnFoRe-LXMERT outperforms the previous SOTA DPR-LXMERT (with the same architecture for visual and textual embedding) by 2.74% MRR@5 and 1.15% precision@5.We perform a student's paired t-tests with a p-value of 5% to test the significance of our results.In particular, we found that the MRR and the precision gap between our EnFoRe (Full) model and (1) the DPR-LXMERT and (2) the EnFoRe (Backbone) are statistically significant.Ablation study on entity sources: We also performed an ablation study on entity-based reranking shown in Table 2.The EnFoRe backbone without re-ranking achieves an MRR of 0.4632, outperforming DPR (Qu et al., 2021) by 1.06%.This indicates that using our entities during training helps the retriever build better representations.It is because (1) we add additional supervision that tells the retriever which entities are more likely to lead to the correct answers, and (2) we add additional training passages that contain both the oracle entities and the right answers.Image-based and Question-based entities help our EnFoRe model achieve MRR of 0.4688 and 0.4750, respectively.

Method Knowledge Resources
VQA Scores Q-only (Marino et al., 2019) -14.9 BAN (Kim et al., 2018) -25.2 MUTAN (Ben-Younes et al., 2017) -26.4 Mucko (Zhu et al., 2020) Dense Caption 29.2 ConceptBert (Gardères et al., 2020) ConceptNet 33.7 KRISP (Marino et al., 2021) Wikipedia + ConceptNet 38.9 MAVEx (Wu et al., 2022) Wikipedia +ConceptNet + Google Image 39.4 RVL (Shevchenko et al., 2021) Wikipedia + ConceptNet 39.0 VRR (Luo et al., 2021) Google Search 39.2 PICa (Yang et al., 2022) Frozen  We also present an ablation study on individual entity sources in Table 4.We introduce a particularly challenging "RetTest Hard" split that collects all of the examples in "RetTest" where none of the correct answers is in the entity set.Our EnFoRe model consistently achieves better retrieval performance (i.e.MRR@5 and P@5) by incorporating entities extracted from each source.On the normal RetTest set, removing entities from candidate answers yields the largest decrease in MRR@5.This is due to the fact that the candidate answers cover plenty of correct answers in the OK-VQA test split and therefore provide direct hints to the desired content.On the RetTest Hard set, image-based entities generally help improve the retrieval performance more, indicating the need for explicitly discovering critical visual clues.

Visual Question Answering Results
We present the VQA performance of incorporating our EnFoRe knowledge in the state-of-theart KAT model in Table 3.While a plain KATbase model, which uses GPT-3 and CLIP (Radford et al., 2021) to retrieve image-based knowledge, achieves a score of (50.58)4 , switching to our En-FoRe knowledge brings a 1.7 point improvements, achieving a score of 51.34 (52.24).Our ensemble model (KAT-full + EnFoRe) achieves a new SOTA score of 54.35 (55.23).Qualitative results: We present sample results in Figure 3 where (a)-(d) show cases where our En-FoRe model correctly identifies the critical entities (i.e. the orange, the kite, the calico cat, and the teddy bear) and retrieved question-relevant knowledge focused on them.Case (e) shows an example where the retrieved sentence misleads the reader, because the reader currently only receives the textual input, and it fails to verify whether the pizza actually has a thin crust.Case (f) shows an example where the retriever properly focuses on the critical entity "NORWOOD" but fails to understand that this is the destination for the bus.Human evaluation: We also conducted a human evaluation on AMT of the retrieved entities and sentences to demonstrate that the knowledge retrieved by EnFoRe better supports the correct answers.We first randomly sampled 1,000 test questions that are correctly answered by both the original KAT-base model and our "KAT-base + EnFoRe" model.Next, we extracted the top-3 sentences with the highest attention score averaged over all attention heads from the last decoder layer for both models.We also extracted the top-3 visual entities.For EnFoRe, the top-3 entities with the highest attention scores in the input prompts are selected.For the original KAT model, we use the three entities from the three top retrieved sentences.Next, we show AMT workers the question, the predicted answer, the image with bounding boxes for the top entities, and the three retrieved sentences, for both systems randomly ordered.We present an example in the Appendix.Finally, workers are asked to judge which system's set of highlighted entities and sentences  best supports the given answer.Experimental results show that judges pick our EnFoRe knowledge 61.8% of the time, indicating a clear preference over the original KAT knowledge.Such information can be considered an explanation or rationale for the system's answer, and improved explanations can engender greater trust and acceptance from users and provide additional transparency of the system's operation.

Conclusion
In this work, we presented an Entity-Focused Retrieval (EnFoRe) model for retrieving knowledge for outside-knowledge visual questions.The goal is to retrieve question-relevant knowledge focused on critical entities.We first construct an entity set by parsing the question and the image.Then, En-FoRe predicts a query-entity score, predicting how likely it will lead to finding a correct answer, and a passage-entity score showing how likely the entity fits in the context of the passage.These two scores are combined to re-rank the conventional querypassage relevancy score.EnFoRe demonstrates the clear advantages of improved multi-modal knowledge retrieval and helps improve VQA performance with its improved retrieved knowledge.

Limitations
Our EnFoRe model is empowered by a comprehensive set of parsed entities from the question and the image.However, as shown in the failure cases in the experiment section, those entities may contain detection errors that lead to undesired results.In addition, during training, we adopt a fully automatic scheme for annotating critical entities assuming they can help a sparse retriever achieve better SRR results; however, explicit human annotation could potentially improve the quality of the critical entities identified.While we have explored collecting both question-based and image-based entities in our current approach, they are not fully adequate in that ideally it could be beneficial to include not only the relevant objects for the visual question but other kinds of descriptors that may act as useful clues for knowledge retrieval.Another limitation of the current approach is that we encode each entity separately, ignoring the relationships between entities, which could be helpful for knowledge retrieval.In Figure 4, we present the MRR (red line), and the oracle-entity recall (blue line) at a cut-off of 5, which is defined as the fraction of oracle entities appearing in the top-5 retrieved passages over the total number of the oracle entities.Our En-FoRe model not only improves the MRR results but also retrieves more oracle entities in the top passages, making the retrieved content more relevant.Also, the EnFoRe model is robust to the reranking weight, yielding consistent improvements for a broad range of weights.
Figure 1: Top: Examples of critical entities upon which retrieval models should focus; Bottom: Example of improved passage retrieval using critical entities.
plush toy was named after what US president?A teddy bear is a stuffed toy in the form of a bear …, and named after President Theodore Roosevelt …

Question:Figure 3 :
Figure 3: Qualitative results on EnFoRe; (a)-(d) present cases where EnFoRe correctly identifies the critical entities and retrieved question-relevant knowledge properly focuses on them; (e) and (f) present two failure cases.

Table 1 :
MRR and precision retreival results on OK-VQA.The first four rows present sparse retrieval results and the others are dense retrieval results.

Table 2 :
Ablation study on the entity sources used during re-ranking.
Qu et al. (2021)retrieval, we adopt the same data configuration asQu et al. (2021)that evenly splits the test set of the OK-VQA dataset into a validation set and a test set, and we refer to these as RetVal and RetTest, respectively.FollowingQu et al. (

Table 3 :
EnFoRe knowledge boosts the current state-of-the-art approaches on OK-VQA.The middle column lists the external knowledge sources if any, used in each system.The additional result shown in parentheses is computed by an unofficial evaluation metric that takes the max over 1.0 and number of annotators agreements divided by 3.

Table 4 :
Ablation study on entity sources.