Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine the retrieved knowledge with state-of-the-art VQA models, and achieve a new state-of-the-art performance on OK-VQA.
Daily scenes are complex in the real world due to occlusion, undesired lighting conditions, etc. Although humans handle those complicated environments well, they evoke challenges for machine learning systems to identify and describe the target without ambiguity. Most previous research focuses on mining discriminating features within the same category for the target object. One the other hand, as the scene becomes more complicated, human frequently uses the neighbor objects as complementary information to describe the target one. Motivated by that, we propose a novel Complementary Neighboring-based Attention Network (CoNAN) that explicitly utilizes the visual differences between the target object and its highly-related neighbors. These highly-related neighbors are determined by an attentional ranking module, as complementary features, highlighting the discriminating aspects for the target object. The speaker module then takes the visual difference features as an additional input to generate the expression. Our qualitative and quantitative results on the dataset RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our generated expressions outperform other state-of-the-art models by a clear margin.
AI systems’ ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have enabled significant progress on many challenging problems such as visual question answering (VQA). However, most of them are opaque black boxes with limited explanatory capability. This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Extensive experimental evaluation demonstrates the advantages of this approach compared to competing methods using both automated metrics and human evaluation.
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% in the Test-standard set using a single model) by simultaneously generating question-relevant captions.