Proceedings of the Third International Workshop on Spatial Language Understanding
In recent years, previous studies have used visual information in named entity recognition (NER) for social media posts with attached images. However, these methods can only be applied to documents with attached images. In this paper, we propose a NER method that can use element-wise visual information for any documents by using image data corresponding to each word in the document. The proposed method obtains element-wise image data using an image retrieval engine, to be used as extra features in the neural NER model. Experimental results on the standard Japanese NER dataset show that the proposed method achieves a higher F1 value (89.67%) than a baseline method, demonstrating the effectiveness of using element-wise visual information.
Spatial information extraction is essential to understand geographical information in text. This task is largely divided to two subtasks: spatial element extraction and spatial relation extraction. In this paper, we utilize BERT (Devlin et al., 2018), which is very effective for many natural language processing applications. We propose a BERT-based spatial information extraction model, which uses BERT for spatial element extraction and R-BERT (Wu and He, 2019) for spatial relation extraction. The model was evaluated with the SemEval 2015 dataset. The result showed a 15.4% point increase in spatial element extraction and an 8.2% point increase in spatial relation extraction in comparison to the baseline model (Nichols and Botros, 2015).
Automatic extraction of spatial information from natural language can boost human-centered applications that rely on spatial dynamics. The field of cognitive linguistics has provided theories and cognitive models to address this task. Yet, existing solutions tend to focus on specific word classes, subject areas, or machine learning techniques that cannot provide cognitively plausible explanations for their decisions. We propose an automated spatial semantic analysis (ASSA) framework building on grammar and cognitive linguistic theories to identify spatial entities and relations, bringing together methods of spatial information extraction and cognitive frameworks on spatial language. The proposed rule-based and explainable approach contributes constructions and preposition schemas and outperforms previous solutions on the CLEF-2017 standard dataset.
They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
Alberto Testoni | Claudio Greco | Tobias Bianchi | Mauricio Mazuecos | Agata Marcante | Luciana Benotti | Raffaella Bernardi
In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.
Various accounts of cognition and semantic representations have highlighted that, for some concepts, different factors may influence category and typicality judgements. In particular, some features may be more salient in categorisation tasks while other features are more salient when assessing typicality. In this paper we explore the extent to which this is the case for English spatial prepositions and discuss the implications for pragmatic strategies and semantic models. We hypothesise that object-specific features — related to object properties and affordances — are more salient in categorisation, while geometric and physical relationships between objects are more salient in typicality judgements. In order to test this hypothesis we conducted a study using virtual environments to collect both category and typicality judgements in 3D scenes. Based on the collected data we cannot verify the hypothesis and conclude that object-specific features appear to be salient in both category and typicality judgements, further evidencing the need to include these types of features in semantic models.
Radiology reports contain important clinical information about patients which are often tied through spatial expressions. Spatial expressions (or triggers) are mainly used to describe the positioning of radiographic findings or medical devices with respect to some anatomical structures. As the expressions result from the mental visualization of the radiologist’s interpretations, they are varied and complex. The focus of this work is to automatically identify the spatial expression terms from three different radiology sub-domains. We propose a hybrid deep learning-based NLP method that includes – 1) generating a set of candidate spatial triggers by exact match with the known trigger terms from the training data, 2) applying domain-specific constraints to filter the candidate triggers, and 3) utilizing a BERT-based classifier to predict whether a candidate trigger is a true spatial trigger or not. The results are promising, with an improvement of 24 points in the average F1 measure compared to a standard BERT-based sequence labeler.
The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in (Chen et al., 2019) and show that the panoramas we have added to StreetLearn support both Touchdown tasks and can be used effectively for further research and comparison.