Event Detection (ED) is a critical task that aims to identify events of certain types in plain text. Neural models have achieved great success on ED, thus coming with a desire for higher interpretability. Existing works mainly exploit words or phrases of the input text to explain models’ inner mechanisms. However, for ED, the event structure, comprising of an event trigger and a set of arguments, are more enlightening clues to explain model behaviors. To this end, we propose a Trigger-Argument based Explanation method (TAE), which can utilize event structure knowledge to uncover a faithful interpretation for the existing ED models at neuron level. Specifically, we design group, sparsity, support mechanisms to construct the event structure from structuralization, compactness, and faithfulness perspectives. We evaluate our model on the large-scale MAVEN and the widely-used ACE 2005 datasets, and observe that TAE is able to reveal the process by which the model predicts. Experimental results also demonstrate that TAE can not only improve the interpretability on standard evaluation metrics, but also effectively facilitate the human understanding.
ConceptBert: Concept-Aware Representation for Visual Question Answering
François Gardères | Maryam Ziaeefard | Baptiste Abeloos | Freddy Lecue
Findings of the Association for Computational Linguistics: EMNLP 2020
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. A VQA model combines visual and textual features in order to answer questions grounded in an image. Current works in VQA focus on questions which are answerable by direct analysis of the question and image alone. We present a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content. Given an image and a question in natural language, ConceptBert requires visual elements of the image and a Knowledge Graph (KG) to infer the correct answer. We introduce a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture. We exploit ConceptNet KG for encoding the common sense knowledge and evaluate our methodology on the Outside Knowledge-VQA (OK-VQA) and VQA datasets.
Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer questions about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a graph-based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset.