Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources
Word embeddings such as Word2Vec not only uniquely identify words but also encode important semantic information about them. However, as single entities they are difficult to interpret and their individual dimensions do not have obvious meanings. A more intuitive and interpretable feature space based on neural representations of words was presented by Binder and colleagues (2016) but is only available for a very limited vocabulary. Previous research (Utsumi, 2018) indicates that Binder features can be predicted for words from their embedding vectors (such as Word2Vec), but only looked at the original Binder vocabulary. This paper aimed to demonstrate that Binder features can effectively be predicted for a large number of new words and that the predicted values are sensible. The results supported this, showing that correlations between predicted feature values were consistent with those in the original Binder dataset. Additionally, vectors of predicted values performed comparatively to established embedding models in tests of word-pair semantic similarity. Being able to predict Binder feature space vectors for any number of new words opens up many uses not possible with the original vocabulary size.
Texts comprise a large part of visual information that we process every day, so one of the tasks of language science is to make them more accessible. However, often the text design process is focused on the font size, but not on its type; which might be crucial especially for the people with reading disabilities. The current paper represents a study on text accessibility and the first attempt to create a research-based accessible font for Cyrillic letters. This resulted in the dyslexic-specific font, LexiaD. Its design rests on the reduction of inter-letter similarity of the Russian alphabet. In evaluation stage, dyslexic and non-dyslexic children were asked to read sentences from the Children version of the Russian Sentence Corpus. We tested the readability of LexiaD compared to PT Sans and PT Serif fonts. The results showed that all children had some advantage in letter feature extraction and information integration while reading in LexiaD, but lexical access was improved when sentences were rendered in PT Sans or PT Serif. Therefore, in several aspects, LexiaD proved to be faster to read and could be recommended to use by dyslexics who have visual deficiency or those who struggle with text understanding resulting in re-reading.
NLP models are imperfect and lack intricate capabilities that humans access automatically when processing speech or reading a text. Human language processing data can be leveraged to increase the performance of models and to pursue explanatory research for a better understanding of the differences between human and machine language processing. We review recent studies leveraging different types of cognitive processing signals, namely eye-tracking, M/EEG and fMRI data recorded during language understanding. We discuss the role of cognitive data for machine learning-based NLP methods and identify fundamental challenges for processing pipelines. Finally, we propose practical strategies for using these types of cognitive signals to enhance NLP models.
Linguistics predictability is the degree of confidence in which language unit (word, part of speech, etc.) will be the next in the sequence. Experiments have shown that the correct prediction simplifies the perception of a language unit and its integration into the context. As a result of an incorrect prediction, language processing slows down. Currently, to get a measure of the language unit predictability, a neurolinguistic experiment known as a cloze task has to be conducted on a large number of participants. Cloze tasks are resource-consuming and are criticized by some researchers as an insufficiently valid measure of predictability. In this paper, we compare different language models that attempt to simulate human respondents’ performance on the cloze task. Using a language model to create cloze task simulations would require significantly less time and conduct studies related to linguistic predictability.
While there is a rich literature on the tracking of sentiment and emotion in texts, modelling the emotional trajectory of longer narratives, such as literary texts, poses new challenges. Previous work in the area of sentiment analysis has focused on using information from within a sentence to predict a valence value for that sentence. We propose to explore the influence of previous sentences on the sentiment of a given sentence. In particular, we investigate whether information present in a history of previous sentences can be used to predict a valence value for the following sentence. We explored both linear and non-linear models applied with a range of different feature combinations. We also looked at different context history sizes to determine what range of previous sentence context was the most informative for our models. We establish a linear relationship between sentence context history and the valence value of the current sentence and demonstrate that sentences in closer proximity to the target sentence are more informative. We show that the inclusion of semantic word embeddings further enriches our model predictions.
We present the Le Petit Prince Corpus (LPPC), a multi-lingual resource for research in (computational) psycho- and neurolinguistics. The corpus consists of the children’s story The Little Prince in 26 languages. The dataset is in the process of being built using state-of-the-art methods for speech and language processing and electroencephalography (EEG). The planned release of LPPC dataset will include raw text annotated with dependency graphs in the Universal Dependencies standard, a near-natural-sounding synthetic spoken subset as well as EEG recordings. We will use this corpus for conducting neurolinguistic studies that generalize across a wide range of languages, overcoming typological constraints to traditional approaches. The planned release of the LPPC combines linguistic and EEG data for many languages using fully automatic methods, and thus constitutes a readily extendable resource that supports cross-linguistic and cross-disciplinary research.
In sentiment analysis, several researchers have used emoji and hashtags as specific forms of training and supervision. Some emotions, such as fear and disgust, are underrepresented in the text of social media. Others, such as anticipation, are absent. This research paper proposes a new dataset for complex emotion detection using a combination of several existing corpora in order to represent and interpret complex emotions based on the Plutchik’s theory. Our experiments and evaluations confirm that using Transfer Learning (TL) with a rich emotional corpus, facilitates the detection of complex emotions in a four-dimensional space. In addition, the incorporation of the rule on the reverse emotions in the model’s architecture brings a significant improvement in terms of precision, recall, and F-score.
Embodied cognitive science suggested a number of variables describing our sensorimotor experience associated with different concepts: modality experience rating (i.e., relationship between words and images of a particular perceptive modality—visual, auditory, haptic etc.), manipulability (the necessity for an object to interact with human hands in order to perform its function), vertical spatial localization. According to the embodied cognition theory, these semantic variables capture our mental representations and thus should influence word learning, processing and production. However, it is not clear how these new variables are related to such traditional variables as imageability, age of acquisition (AoA) and word frequency. In the presented database, normative data on the modality (visual, auditory, haptic, olfactory, and gustatory) ratings, vertical spatial localization of the object, manipulability, imageability, age of acquisition, and subjective frequency for 506 Russian nouns are collected. Factor analysis revealed four factors: (1) visual and haptic modality ratings were combined with imageability, manipulability and AoA; (2) word length, frequency and AoA; (3) olfactory modality was united with gustatory; (4) spatial localization only was included in the fourth factor. The database is available online together with a publication describing the method of data collection and data parameters (Miklashevsky, 2018).