In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.
Supervised machine learning algorithms require training data whose generation for complex relation extraction tasks tends to be difficult. Being optimized for relation extraction at sentence level, many annotation tools lack in facilitating the annotation of relational structures that are widely spread across the text. This leads to non-intuitive and cumbersome visualizations, making the annotation process unnecessarily time-consuming. We propose SANTO, an easy-to-use, domain-adaptive annotation tool specialized for complex slot filling tasks which may involve problems of cardinality and referential grounding. The web-based architecture enables fast and clearly structured annotation for multiple users in parallel. Relational structures are formulated as templates following the conceptualization of an underlying ontology. Further, import and export procedures of standard formats enable interoperability with external sources and tools.
We propose to study the evolution of concepts by learning to complete diachronic analogies between lists of terms which relate to the same concept at different points in time. We present a number of models based on operations on word embedddings that correspond to different assumptions about the characteristics of diachronic analogies and change in concept vocabularies. These are tested in a quantitative evaluation for nine different concepts on a corpus of Dutch newspapers from the 1950s and 1980s. We show that a model which treats the concept terms as analogous and learns weights to compensate for diachronic changes (weighted linear combination) is able to more accurately predict the missing term than a learned transformation and two baselines for most of the evaluated concepts. We also find that all models tend to be coherent in relation to the represented concept, but less discriminative in regard to other concepts. Additionally, we evaluate the effect of aligning the time-specific embedding spaces using orthogonal Procrustes, finding varying effects on performance, depending on the model, concept and evaluation metric. For the weighted linear combination, however, results improve with alignment in a majority of cases. All related code is released publicly.
Word embeddings have been shown to be highly effective in a variety of lexical semantic tasks. They tend to capture meaningful relational similarities between individual words, at the expense of lacking the capabilty of making the underlying semantic relation explicit. In this paper, we investigate the attribute relation that often holds between the constituents of adjective-noun phrases. We use CBOW word embeddings to represent word meaning and learn a compositionality function that combines the individual constituents into a phrase representation, thus capturing the compositional attribute meaning. The resulting embedding model, while being fully interpretable, outperforms count-based distributional vector space models that are tailored to attribute meaning in the two tasks of attribute selection and phrase similarity prediction. Moreover, as the model captures a generalized layer of attribute meaning, it bears the potential to be used for predictions over various attribute inventories without re-training.
Social media are used by an increasing number of political actors. A small subset of these is interested in pursuing extremist motives such as mobilization, recruiting or radicalization activities. In order to counteract these trends, online providers and state institutions reinforce their monitoring efforts, mostly relying on manual workflows. We propose a machine learning approach to support manual attempts towards identifying right-wing extremist content in German Twitter profiles. Based on a fine-grained conceptualization of right-wing extremism, we frame the task as ranking each individual profile on a continuum spanning different degrees of right-wing extremism, based on a nearest neighbour approach. A quantitative evaluation reveals that our ranking model yields robust performance (up to 0.81 F1 score) when being used for predicting discrete class labels. At the same time, the model provides plausible continuous ranking scores for a small sample of borderline cases at the division of right-wing extremism and New Right political movements.
We present a semi-supervised machine-learning approach for the classification of adjectives into property- vs. relation-denoting adjectives, a distinction that is highly relevant for ontology learning. The feasibility of this classification task is evaluated in a human annotation experiment. We observe that token-level annotation of these classes is expensive and difficult. Yet, a careful corpus analysis reveals that adjective classes tend to be stable, with few occurrences of class shifts observed at the token level. As a consequence, we opt for a type-based semi-supervised classification approach. The class labels obtained from manual annotation are projected to large amounts of unannotated token samples. Training on heuristically labeled data yields high classification performance on our own data and on a data set compiled from WordNet. Our results suggest that it is feasible to automatically distinguish adjectives denoting properties and relations, using small amounts of annotated data.
In this paper, we present HeiNER, the multilingual Heidelberg Named Entity Resource. HeiNER contains 1,547,586 disambiguated English Named Entities together with translations and transliterations to 15 languages. Our work builds on the approach described in (Bunescu and Pasca, 2006), yet extends it to a multilingual dimension. Translating Named Entities into the various target languages is carried out by exploiting crosslingual information contained in the online encyclopedia Wikipedia. In addition, HeiNER provides linguistic contexts for every NE in all target languages which makes it a valuable resource for multilingual Named Entity Recognition, Disambiguation and Classification. The results of our evaluation against the assessments of human annotators yield a high precision of 0.95 for the NEs we extract from the English Wikipedia. These source language NEs are thus very reliable seeds for our multilingual NE translation method.
Recent work has aimed at discovering ontological relations from text corpora. Most approaches are based on the assumption that verbs typically indicate semantic relations between concepts. However, the problem of finding the appropriate generalization level for the verb's arguments with respect to a given taxonomy has not received much attention in the ontology learning community. In this paper, we address the issue of determining the appropriate level of abstraction for binary relations extracted from a corpus with respect to a given concept hierarchy. For this purpose, we reuse techniques from the subcategorization and selectional restrictions acquisition communities. The contribution of our work lies in the systematic analysis of three different measures. We conduct our experiments on the Genia corpus and the Genia ontology and evaluate the different measures by comparing the results of our approach with a gold standard provided by one of the authors, a biologist.