Olfaction is a rather understudied sense compared to the other senses. In NLP, however, there have been recent attempts to develop taxonomies and benchmarks specifically designed to capture smell-related information. In this work, we further extend this research line by presenting a supervised system for olfactory information extraction in English. We cast this problem as a token classification task and build a system that identifies smell words, smell sources and qualities. The classifier is then applied to a set of English historical corpora, covering different domains and written in a time period between the 15th and the 20th Century. A qualitative analysis of the extracted data shows that they can be used to infer interesting information about smelly items such as tea and tobacco from a diachronical perspective, supporting historical investigation with corpus-based evidence.
We present a benchmark in six European languages containing manually annotated information about olfactory situations and events following a FrameNet-like approach. The documents selection covers ten domains of interest to cultural historians in the olfactory domain and includes texts published between 1620 to 1920, allowing a diachronic analysis of smell descriptions. With this work, we aim to foster the development of olfactory information extraction approaches as well as the analysis of changes in smell descriptions over time.
In this paper we present KIND, an Italian dataset for Named-entity recognition. It contains more than one million tokens with annotation covering three classes: person, location, and organization. The dataset (around 600K tokens) mostly contains manual gold annotations in three different domains (news, literature, and political discourses) and a semi-automatically annotated part. The multi-domain feature is the main strength of the present work, offering a resource which covers different styles and language uses, as well as the largest Italian NER dataset with manual gold annotations. It represents an important resource for the training of NER systems in Italian. Texts and annotations are freely downloadable from the Github repository.
Olfactory references play a crucial role in our memory and, more generally, in our experiences, since researchers have shown that smell is the sense that is most directly connected with emotions. Nevertheless, only few works in NLP have tried to capture this sensory dimension from a computational perspective. One of the main challenges is the lack of a systematic and consistent taxonomy of olfactory information, where concepts are organised also in a multi-lingual perspective. WordNet represents a valuable starting point in this direction, which can be semi-automatically extended taking advantage of Google n-grams and of existing language models. In this work we describe the process that has led to the semi-automatic development of a taxonomy for olfactory information in four languages (English, French, German and Italian), detailing the different steps and the intermediate evaluations. Along with being multi-lingual, the taxonomy also encloses temporal marks for olfactory terms thus making it a valuable resource for historical content analysis. The resource has been released and is freely available.