In Factuality: Efficient Integration of Relevant Facts for Visual Question Answering

Visual Question Answering (VQA) methods aim at leveraging visual input to answer questions that may require complex reasoning over entities. Current models are trained on labelled data that may be insufficient to learn complex knowledge representations. In this paper, we propose a new method to enhance the reasoning capabilities of a multi-modal pretrained model (Vision+Language BERT) by integrating facts extracted from an external knowledge base. Evaluation on the KVQA dataset benchmark demonstrates that our method outperforms competitive baselines by 19%, achieving new state-of-the-art results. We also perform an extensive analysis highlighting the limitations of our best performing model through an ablation study.


Introduction
Visual Question Answering (VQA) is a popular multi-modal task of answering a question about an image. It tracks both inter-modal interactions and reasoning capabilities of models (Wang et al., 2017;Marino et al., 2019). Recent studies have tested compositional reasoning Hudson and Manning, 2019) and the integration of external knowledge (Wang et al., 2017(Wang et al., , 2016Shah et al., 2019;Marino et al., 2019) for VQA. In this paper, we address Knowledge-aware VQA (KVQA) (Shah et al., 2019) 1 , defined as a VQA task where it is not reasonable to expect a model without access to a knowledge base to be able to answer the questions in the test set.
In a uni-modal textual context, both synthetic dataset (Kassner et al., 2020) and task-driven (Ding et al., 2020) studies of neural models have shown significant competence at symbolic reasoning. This is encouraging, as neural pretrained Language Models such as BERT (Devlin et al., 2019) achieve state-of-the-art results in a wide range of natural language inference tasks and benchmarks such as Natural Language Inference (Bowman et al., 2015). (Rajani et al., 2019) uses pretraining on a domainspecific dataset to improve CommonsenseQA by 10% absolute accuracy. Tamborrino et al. (2020) develop an improved training objective to improve COPA by 10% absolute accuracy. Bouraoui et al. (2020) find that BERT is capable of relational induction, whilst Broscheit (2019); Petroni et al. (2020) find that BERT stores nontrivial world-knowledge.
Previous work has argued that restriction to a uni-modal context may itself impair reasoning performance (Barsalou, 2008;Li et al., 2020). In a bimodal Vision + Language (V+L) context, datasets such as CLEVR and GQA allow for the evaluation of both model reasoning and language grounding. Within this setting, Ding et al. (2020) and Lu et al. (2020) show that appropriate neural models trained on large quantities of data can exhibit accurate reasoning.
In this paper, we propose a new method of applying a massively pretrained V+L BERT model (Chen et al., 2020) to the KVQA task (Shah et al., 2019). Our method is able to learn a set of reasoning types (confirming findings in Ding et al. (2020)) but can increase performance even more by incorporating external factual information. KVQA answers require attending to a knowledge base, allowing us to quantify the contribution of both explicit and implicit knowledge extracted from supervised training data. We also quantify the degree to which corpus bias makes certain question types harder, and outline how future datasets may be better balanced.
Our contributions are as follows: • We perform factual integration into a V+L BERT-based model architecture VQA, leading to 19.1% accuracy improvement over previous baselines on KVQA.
• We evaluate our model's reasoning capabilities through an ablation study, proposing explanations for poor performance on certain question types as well as highlighting our model's strong preference for text and facts over the image modality.
• We conduct a bias study of the KVQA dataset, revealing both strengths and potential improvements for future VQA datasets.

Related Work
VQA tasks explicitly encourage grounded reasoning (Antol et al., 2015), with emphasis on a variety of sub-domains, such as commonsense (Zellers et al., 2019), compositionality and grounding (Suhr et al., 2020), factual reasoning (Wang et al., 2017) or external knowledge reasoning (Wang et al., 2016;Marino et al., 2019;Shah et al., 2019). State-of-the-art systems for external knowledge VQA are based on Memory networks (MemNet, (Weston et al., 2014)). In Shah et al. (2019), the facts are extracted from the Knowledge Graph (KG) by considering the visual (from image) and eventually textual (from Wikipedia caption) entities. They are then embedded using a Bi-LSTM encoder and fed into the memory. After the question is embedded in a similar way, the resulting representation is used to query the memory by soft attention. Several stacked memory layers are used to better model multi-hop facts. Wang et al. (2016Wang et al. ( , 2017 introduce two datasets, KB-VQA and FVQA respectively, and address the task with systems that perform searches in a visual knowledge graph formed from the image and a KB. The question is first mapped to a query of the form 〈visual object, relationship, answer source〉, which is then used to extract the supporting facts from the KB. They report improved results when compared to systems using LSTM, SVM and hierarchical co-attention (Lu et al., 2016).
In Marino et al. (2019), the OK-VQA is presented with some baseline results obtained with MUTAN (Ben-younes et al., 2017), a multimodal tensor-based Tucker decomposition which models interactions between visual (from CNN) and textual (from RNN) representations. Those systems exhibit rather low performance compared to those obtained on standard VQA, demonstrating that the corpus requires external knowledge to be solved correctly.
Recent work has introduced methods to incorporate visual information to create Vision+Language BERT models through joint multimodal embeddings (Chen et al., 2020;Su et al., 2019;. First, image and text are embedded into the same space, and then Transformer networks are applied as in the standard BERT model (Devlin et al., 2019).
Our work is most similar to that of Shah et al. (2019) since the same preprocessing pipeline is used. However, our system does not use a memory network, and instead relies on on a BERT-based model (UNITER, see section 3) to model the relationship between question, facts, and image with self-attention layers.

Methodology
To answer KVQA with Neural models, we first take the V+L BERT model UNITER (Chen et al., 2020) with the highest score on the commonsense VQA task, VCR (Zellers et al., 2019).
In order to allow UNITER to accept external KG facts, we cast these facts to a textual form 'Entity 1 Relation Entity 2 '. To keep the input facts count small, we perform a conditional search of the KG. The KVQA task consists in finding a * : where a * is the correct answer out of candidate set A; and q, i, and K are a question, image and knowledge base, respectively. As shown, we may reduce the KG through a conditional search to find the relevant subset of facts k i,q .
To define the subset k i,q , we follow Shah et al. (2019) in extracting all facts from the knowledge base that are up to two hops from any entities detected by the textual entity linking or the face detection. Our model, as presented in section 2 consists of two stages: preprocessing, which implements relevant fact extraction, and reasoning, which selects an answer from the question, facts, and image features.

Preprocessing Stage
For preprocessing and fact acquisition, we broadly reproduce the fact and feature extraction process used in Shah et al. (2019). We perform object detection with the Faster R-CNN network (Ren et al., 2017). A seven-dimensional normalised size and location vector is concatenated with the Faster R-CNN features.
For person detection, we use MTCNN  and Facenet (Schroff et al., 2015) models, pretrained on the MS-celeb-1M (Guo et al., 2016) dataset, to generate 128-dimensional embeddings. We predict names by nearest-neighbour comparison with the KVQA reference dataset. We treat the name identification as a multi-class classification problem, achieving a Micro-F1 of 0.539. Since this is lower than reported in Shah et al. (2019), we follow them in applying a textual entity linker (van Hulst et al., 2020) over supplied image descriptions. This setup achieves a per-image Micro-F1 of 0.686.
Normalised image location facts are generated from these detections, such as 'Barack Obama at 42 78', which would indicate that the centre bounding box for Barack Obama is at normalised (0-100) position x=42, y=78 of the image. We use the names of identified entities to query Shah et al.'s 2019 reduced Wikidata graph (Vrandečić and Krötzsch, 2014) up to two hops. The extracted facts are finally cast to the form 'subject relation object'.

Reasoning Stage
The neural model we use, UNITER, is pretrained on MS COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2016), Conceptual Captions (Sharma et al., 2018), and SBU Captions (Ordonez et al., 2011). It is a multi-task system that is trained on performing Masked Language Modeling, Image-Text Matching, and Masked Region Modeling (Chen et al., 2020).

Experimental Setup
We select the KVQA dataset for two reasons: to our knowledge, it is the largest external knowledge dataset (with 183k questions), and the questions are annotated with their reasoning types. We use accuracy as the evaluation metric and provide results over both the entire dataset and also for each question type as provided in the KVQA dataset.
The baseline systems for KVQA are those presented in (Shah et al., 2019) and discussed in section 2. The first baseline is a stacked BLSTM encoder, operating over question and facts. This system has an overall accuracy of 48.0% . The second is the MemNet architecture and has the previously highest performing baseline accuracy at 50.2%.
We use the UNITER BASE pretrained model available at the ChenRocks GitHub repository 2 with custom classification layers (MLP +softmax output layer). For task training, we merge retrieved facts with the question, dividing each statement with the '[SEP]' token, following research that indicates that this token induces partitioning and pipelining of information across attention layers (Clark et al., 2019). The textual input stream is tokenised with the HuggingFace 'bert-base-uncased' tokeniser (Wolf et al., 2020). We set the maximum WordPiece sequences length to 412, the maximum visual objects count to 100, the learning rate to   (2019)) with an absolute improvement of 19%.
Our results show that UNITER is learning to perform reasoning more accurately than MemNet in all but two cases. In the question types involving multiple entities ('Multi-Entity', 'Multi-Hop', 'Multi-Relation'), the increase is the greatest, suggesting that UNITER is able to robustly learn these reasoning here. We speculate that stacked selfattention layers in BERT are able to better attend to the many involved entities than MemNet.
We now discuss the performance of our model on its weakest categories, namely 'Subtraction' and 'Spatial'. The poor performance on 'Subtraction' questions confirms previous results that BERT-like models require specialised pretraining for numerical reasoning tasks (Geva et al., 2020). In the case of our model specifically, we note the lack of numerical reasoning tasks in UNITER's pretraining regime. 'Spatial' is the model's least accurate question type (21.4%) and the biggest absolute de- Both of these have been shown to be problematic for BERT (Kassner et al., 2020;Geva et al., 2020).

Analysis
UNITER performs well at the reasoning tasks in general, with the most surprising result being that it apparently does better at multi-hop reasoning than one-hop. We believe that this can be explained by the presence of unbalanced distribution of answer types in the dataset perturbing the results (see Table 1). We discuss this in Section 6.1. In order to better understand the reasoning capability of our model and the impact of each input modality, we perform an inference time ablation study, presented in Table 2.
Ablation of Image features (column 'Q+F') does not change the performance, suggesting that the model is not attending to image features. To confirm this hypothesis, we performed an experiment with adversarial images, obtaining very similar results for each question type and the same overall score (69.30%). We explain this behaviour by the fact that the preprocessing pipeline extracts all the required information as explicit facts which the model prefers over the more ambiguous visual features. We leave a deeper analysis for further work.
An interesting case is the 'Spatial' questions, where facts alone are able to correctly answer 13% of the questions. This is likely the result of the answers to this question type being entities present in the facts. Again, we observe that the model is not able to learn this information from the visual features.  Table 3: Further Ablation and Adversarial Studies. *Adversarial Modality indicates that the sample from that modality was randomly assigned from the entire data split

Bias Studies
We briefly discuss the corpus bias, a well-known concern in VQA ( Goyal et al., 2019). We consider question difficulty across three parameters: reasoning difficulty, task design, and corpus bias. Certain question types are inherently more complex, as discussed in Section 5. Additionally, the task may have different numbers of answer classes per task, effectively weakening any priors models might form (see Entropy column in Table 1).
Finally, an unbalanced dataset may cause certain reasoning types to be underrepresented, making it harder for models to learn for them. 'Spatial' and 'Substraction' questions are among the least represented in the training dataset, which increase their difficulty for the model. Unseen answer classes are also an issue. For 'Spatial' questions, only 54.2% of the test answers (output classes) are actually seen during training, placing an upper bound on accuracy. We find 98.4% of 'Spatial' questions the model answered correctly and 95.7% of 'Spatial' question the model answered incorrectly were supplied with adequate facts by the preprocessing pipeline.
Training time ablation and adversarial experiments To further probe the task, we perform a training time ablation with first facts, and then facts and images removed (see Table 3). In this we seek to exhibit the capability of our model to leverage the available modalities and to compensate for the missing ones.
Through comparing the training time and inference time ablations, we can better understand the importance of a modality to solving the task.
Through comparing train and inference ablation of facts ('Q+I' column of Table 3 and of Table 2) we observe that when facts are unavailable at train time, the model attends to images to obtain 47.0% accuracy, which is 15.4% more than the 31.6% obtained by the corresponding inference time ablation. This indicates that the visual modality can provide useful information for this task.
We observe a similar trend in the fact and image ablation setting ('Q' column of Table 3 and of Table 2) that the model is able to greater leverage questions to make accurate predictions when additional modalities are never available.
We also perform adversarial checks, where random images or facts from the data split are presented at inference time. These align closely with the ablation study, with adversarial images (Column 'I' of Table 3) performing within 0.1% of blanked images (Column 'Q+F' of Table 3) and adversarial facts (Column 'F' of Table 3) performing within 1% of blanked facts (Column 'Q+I' of Table 3). These results confirm the importance of factual data and the unimportance of raw image features to a model trained on the full data.

Conclusion and Future Work
We evaluated our model and found that it improves on the previous state of the art by a substantial margin (19.1%). An ablation study revealed the specific strengths and weaknesses of our model on certain question categories when evaluated on the KVQA dataset. We show that the UNITER model is not actually using the visual input.
In the future, we seek to create a large external knowledge dataset designed following KVQA with more entities besides persons to encourage grounded reasoning, and better calibration of answer types. We will also consider pretraining our model on closely related tasks. This will help to form a model capable of learning robust reasoning with a high degree of spatial specificity and entity discrimination.

Ethical Statement
This work is based on the open-source KVQA dataset, an English multimodal dataset, and the Wikidata knowledge base (also in English). No English-specific preprocessing was used for this research and the UNITER model is language agnostic, which tends to suggest that this could generalize to other languages. We will make our code publicly available to ensure the reproducibility of our experiments in the following repository